Thread: Write Ahead Logging for Hash Indexes

Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1] of making hash indexes
concurrent.  The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.
=================================

Multiple WAL records are being written for create index operation,
first for initializing the metapage, followed by one for each new
bucket created during operation followed by one for initializing the
bitmap page.  If the system crashes after any operation, the whole
operation is rolledback.  I have considered to write a single WAL
record for the whole operation, but for that we need to limit the
number of initial buckets that can be created during the operation. As
we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
current XLog machinery, it is better to write multiple WAL records for
this operation.  The downside of restricting the number of buckets is
that we need to perform split operation if the number of tuples are
more than what can be accommodated in initial set of buckets and it is
not unusual to have large number of tuples during create index
operation.

Ordinary item insertions (that don't force a page split or need a new
overflow page) are single WAL entries.  They touch a single bucket
page and meta page, metapage is updated during replay as it is updated
during original operation.

An insertion that causes an addition of an overflow page is logged as
a single WAL entry preceded by a WAL entry for a new overflow page
required to insert a tuple.  There is a corner case where by the time
we try to use newly allocated overflow page, it already gets used by
concurrent insertions, for such a case, a new overflow page will be
allocated and a separate WAL entry will be made for the same.

An insertion that causes a bucket split is logged as a single WAL
entry, followed by a WAL entry for allocating a new bucket, followed
by a WAL entry for each overflow bucket page in the new bucket to
which the tuples are moved from old bucket, followed by a WAL entry to
indicate that split is complete for both old and new buckets.

A split operation which requires overflow pages to complete the
operation will need to write a WAL record for each new allocation of
an overflow page. As splitting involves multiple atomic actions, it's
possible that the system crashes between moving tuples from bucket
pages of old bucket to new bucket.  After recovery, both the old and
new buckets will be marked with in_complete split flag.  The reader
algorithm works correctly, as it will scan both the old and new
buckets.

We finish the split at next insert or split operation on old bucket.
It could be done during searches, too, but it seems best not to put
any extra updates in what would otherwise be a read-only operation
(updating is not possible in hot standby mode anyway).  It would seem
natural to complete the split in VACUUM, but since splitting a bucket
might require to allocate a new page, it might fail if you run out of
disk space.  That would be bad during VACUUM - the reason for running
VACUUM in the first place might be that you run out of disk space, and
now VACUUM won't finish because you're out of disk space.  In
contrast, an insertion can require enlarging the physical file anyway.

Deletion of tuples from a bucket is performed for two reasons, one for
removing the dead tuples and other for removing the tuples that are
moved by split.  WAL entry is made for each bucket page from which
tuples are removed, followed by a WAL entry to clear the garbage flag
if the tuples moved by split are removed.  Another separate WAL entry
is made for updating the metapage if the deletion is performed for
removing the dead tuples by vaccum.

As deletion involves multiple atomic operations, it is quite possible
that system crashes after (a) removing tuples from some of the bucket
pages (b) before clearing the garbage flag (c) before updating the
metapage.  If the system crashes before completing (b), it will again
try to clean the bucket during next vacuum or insert after recovery
which can have some performance impact, but it will work fine. If the
system crashes before completing (c), after recovery there could be
some additional splits till the next vacuum updates the metapage, but
the other operations like insert, delete and scan will work correctly.
We can fix this problem by actually updating the metapage based on
delete operation during replay, but not sure if it is worth the
complication.

Squeeze operation moves tuples from one of the buckets later in the
chain to one of the bucket earlier in chain and writes WAL record when
either the bucket to which it is writing tuples is filled or bucket
from which it is removing the tuples becomes empty.

As Squeeze operation involves writing multiple atomic operations, it
is quite possible, that system crashes before completing the operation
on entire bucket.  After recovery, the operations will work correctly,
but the index will remain bloated and can impact performance of read
and insert operations until the next vacuum squeezes the bucket
completely.

=====================================


One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations.  Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

Thanks to Ashutosh Sharma who has helped me in completing the patch by
writing WAL for create index and delete operation and done the
detailed testing of patch by using pg_filedump tool. I think it is
better if he himself explains the testing he has done to ensure
correctness of patch.


Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Tue, Aug 23, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> $SUBJECT will make hash indexes reliable and usable on standby.
> AFAIU, currently hash indexes are not recommended to be used in
> production mainly because they are not crash-safe and with this patch,
> I hope we can address that limitation and recommend them for use in
> production.
>
> This patch is built on my earlier patch [1] of making hash indexes
> concurrent.  The main reason for doing so is that the earlier patch
> allows to complete the split operation and used light-weight locking
> due to which operations can be logged at granular level.
>
> WAL for different operations:
>
> This has been explained in README as well, but I am again writing it
> here for the ease of people.
>
>
..
> One of the challenge in writing this patch was that the current code
> was not written with a mindset that we need to write WAL for different
> operations.  Typical example is _hash_addovflpage() where pages are
> modified across different function calls and all modifications needs
> to be done atomically, so I have to refactor some code so that the
> operations can be logged sensibly.
>

This patch has not done handling for OldSnapshot.  Previously, we
haven't done TestForOldSnapshot() checks in hash index as they were
not logged, but now with this patch, it makes sense to perform such
checks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Ashutosh Sharma
Date:
Hi All,

Following are the steps that i have followed to verify the WAL Logging
of hash index,

1. I used Mithun's patch to improve coverage of hash index code [1] to
verify the WAL Logging of hash index. Firstly i have confirmed if all
the XLOG records associated with hash index are being covered or not
using this patch. In case if any of the XLOG record for hash index
operation is not being covered i have added a testcase for it. I have
found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was
not being covered and added a small testcase for the same. The patch
for this is available @ [2].

2. I executed the regression test suite and found all the hash indexes
that are getting created as a part of regression test suite using the
below query.
   SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx
ON idx.oid = t.relam WHERE idx.amname = 'hash';

3. Thirdly, I have calculated the number of pages associated with each
hash index and compared every page of hash index on master and standby
server using pg_filedump tool.  As for example if the number of pages
associated with 'con_hash_index' is 10 then here is what i did,

On master:
-----------------
select pg_relation_filepath('con_hash_index');pg_relation_filepath
----------------------base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433
> /tmp/file1

On Slave:
---------------
select pg_relation_filepath('con_hash_index');pg_relation_filepath
----------------------base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433
> /tmp/file2

compared file1 and file2 using some diff tool.

Following are the list of hash indexes that got created inside
regression database when regression test suite was executed on a
master server.

hash_i4_index
hash_name_index
hash_txt_index
hash_f8_index
con_hash_index
hash_idx

In short, this is all i did and found no issues during testing. Please
let me know if you need any further details.

I would like to Thank Amit for his support and guidance during the
testing phase.

[1] - https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAE9k0PkNjryhSiG53mjnKFhi%2BMipJMjSa%3DYkH-UeW3bfr1HPJQ%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
Hi Amit,

Thanks for working on this.

When building with --enable-cassert, I get compiler warning:

hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function


After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

With this backtrace:

(gdb) bt
#0  0x0000003838c325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003838c33dc5 in abort () at abort.c:92
#2  0x000000000081a8fd in ExceptionalCondition (conditionName=<value optimized out>, errorType=<value optimized out>, fileName=<value optimized out>,
    lineNumber=<value optimized out>) at assert.c:54
#3  0x00000000004a4199 in _hash_freeovflpage (rel=0x7f3f745d86b8, bucketbuf=198, ovflbuf=199, wbuf=198, itups=0x7ffc258fa090, itup_offsets=0x126e8a8,
    tups_size=0x7ffc258f93d0, nitups=70, bstrategy=0x12ba320) at hashovfl.c:553
#4  0x00000000004a4c32 in _hash_squeezebucket (rel=<value optimized out>, bucket=38, bucket_blkno=56, bucket_buf=198, bstrategy=0x12ba320)
    at hashovfl.c:1010
#5  0x00000000004a042a in hashbucketcleanup (rel=0x7f3f745d86b8, bucket_buf=198, bucket_blkno=56, bstrategy=0x12ba320, maxbucket=96, highmask=127,
    lowmask=63, tuples_removed=0x7ffc258fc1c8, num_index_tuples=0x7ffc258fc1c0, bucket_has_garbage=0 '\000', delay=1 '\001',
    callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:937
#6  0x00000000004a07e7 in hashbulkdelete (info=0x7ffc258fc2b0, stats=0x0, callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:580
#7  0x00000000005e98c5 in lazy_vacuum_index (indrel=0x7f3f745d86b8, stats=0x126ecc0, vacrelstats=0x126e248) at vacuumlazy.c:1599
#8  0x00000000005ea7f9 in lazy_scan_heap (onerel=<value optimized out>, options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized out>)
    at vacuumlazy.c:1291
#9  lazy_vacuum_rel (onerel=<value optimized out>, options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized out>) at vacuumlazy.c:255
#10 0x00000000005e8939 in vacuum_rel (relid=17329, relation=0x7ffc258fcbd0, options=99, params=0x12ba290) at vacuum.c:1399
#11 0x00000000005e8d01 in vacuum (options=99, relation=0x7ffc258fcbd0, relid=<value optimized out>, params=0x12ba290, va_cols=0x0,
    bstrategy=<value optimized out>, isTopLevel=1 '\001') at vacuum.c:307
#12 0x00000000006a07f1 in autovacuum_do_vac_analyze () at autovacuum.c:2823
#13 do_autovacuum () at autovacuum.c:2341
#14 0x00000000006a0f9c in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1656
#15 0x00000000006a1116 in StartAutoVacWorker () at autovacuum.c:1461
#16 0x00000000006afb00 in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5323
#17 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5009
#18 <signal handler called>
#19 0x0000003838ce1503 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#20 0x00000000006b0ec0 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1657
#21 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1301
#22 0x0000000000632e88 in main (argc=4, argv=0x11f4d50) at main.c:228

Cheers,

Jeff

Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 23/08/16 15:24, Amit Kapila wrote:
>
> Thoughts?
>
> Note - To use this patch, first apply latest version of concurrent
> hash index patch [2].
>
> [1] - https://commitfest.postgresql.org/10/647/
> [2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com
>
>

Firstly - really nice! Patches applied easily etc to latest version 10 
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, 
then adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts   Table "public.pgbench_accounts"  Column  |     Type      | Modifiers
----------+---------------+----------- aid      | integer       | not null bid      | integer       | abalance |
integer      | filler   | character(84) |
 
Indexes:    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)    "pgbench_accounts_bid" hash (bid)    <====

bench=# \d pgbench_history          Table "public.pgbench_history" Column |            Type             | Modifiers
--------+-----------------------------+----------- tid    | integer                     | bid    | integer
      | aid    | integer                     | delta  | integer                     | mtime  | timestamp without time
zone| filler | character(22)               |
 
Indexes:    "pgbench_history_bid" hash (bid)     <=====


they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS 
wait,query FROM pg_stat_activity ORDER BY wait DESC; datname | application_name | state  |      wait | query

---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
       | walreceiver      | idle   |                 | bench   | pgbench          | active | 00:31:38.367467 | INSERT
INTO
 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, 
-3868, CURRENT_TIMESTAMP); bench   | pgbench          | active | 00:31:38.215378 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814, 
2091, CURRENT_TIMESTAMP); bench   | pgbench          | active | 00:31:35.991056 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237, 
3438, CURRENT_TIMESTAMP); bench   | pgbench          | active | 00:31:35.619798 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994, 
-2443, CURRENT_TIMESTAMP); bench   | pgbench          | active | 00:31:35.544196 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 
4053, CURRENT_TIMESTAMP); bench   | pgbench          | active | 00:31:35.334504 | UPDATE 
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33; bench   | pgbench          | active | 00:31:35.234112
|UPDATE 
 
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38; bench   | pgbench          | active | 00:31:34.434676 |
UPDATE
 
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33; bench   | psql             | active | 00:00:00        |
SELECT
 
datname,application_name,state,now()-xact_start AS wait,query FROM 
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Cheers

Mark








Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 24/08/16 12:09, Mark Kirkwood wrote:
> On 23/08/16 15:24, Amit Kapila wrote:
>>
>> Thoughts?
>>
>> Note - To use this patch, first apply latest version of concurrent
>> hash index patch [2].
>>
>> [1] - https://commitfest.postgresql.org/10/647/
>> [2] - 
>> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com
>>
>>
>
> Firstly - really nice! Patches applied easily etc to latest version 10 
> checkout.
>
> I thought I'd test by initializing pgbench schema, adding a standby, 
> then adding some hash indexes and running pgbench:
>
> Looking on the standby:
>
> bench=# \d pgbench_accounts
>    Table "public.pgbench_accounts"
>   Column  |     Type      | Modifiers
> ----------+---------------+-----------
>  aid      | integer       | not null
>  bid      | integer       |
>  abalance | integer       |
>  filler   | character(84) |
> Indexes:
>     "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
>     "pgbench_accounts_bid" hash (bid)    <====
>
> bench=# \d pgbench_history
>           Table "public.pgbench_history"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  tid    | integer                     |
>  bid    | integer                     |
>  aid    | integer                     |
>  delta  | integer                     |
>  mtime  | timestamp without time zone |
>  filler | character(22)               |
> Indexes:
>     "pgbench_history_bid" hash (bid)     <=====
>
>
> they have been replicated there ok.
>
> However I'm seeing a hang on the master after a while:
>
> bench=# SELECT datname,application_name,state,now()-xact_start AS 
> wait,query FROM pg_stat_activity ORDER BY wait DESC;
>  datname | application_name | state  |      wait | query
>
---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------

>
>          | walreceiver      | idle   |                 |
>  bench   | pgbench          | active | 00:31:38.367467 | INSERT INTO 
> pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, 
> -3868, CURRENT_TIMESTAMP);
>  bench   | pgbench          | active | 00:31:38.215378 | INSERT INTO 
> pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 
> 3954814, 2091, CURRENT_TIMESTAMP);
>  bench   | pgbench          | active | 00:31:35.991056 | INSERT INTO 
> pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 
> 8355237, 3438, CURRENT_TIMESTAMP);
>  bench   | pgbench          | active | 00:31:35.619798 | INSERT INTO 
> pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 
> 4839994, -2443, CURRENT_TIMESTAMP);
>  bench   | pgbench          | active | 00:31:35.544196 | INSERT INTO 
> pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 
> 4053, CURRENT_TIMESTAMP);
>  bench   | pgbench          | active | 00:31:35.334504 | UPDATE 
> pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
>  bench   | pgbench          | active | 00:31:35.234112 | UPDATE 
> pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
>  bench   | pgbench          | active | 00:31:34.434676 | UPDATE 
> pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
>  bench   | psql             | active | 00:00:00        | SELECT 
> datname,application_name,state,now()-xact_start AS wait,query FROM 
> pg_stat_activity ORDER BY wait DESC;
> (10 rows)
>
> but no errors in the logs, any thoughts?


FWIW, retesting with the wal logging patch removed (i.e leaving the 
concurrent hast one) works fine.



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 24/08/16 12:09, Mark Kirkwood wrote:
>>
>> On 23/08/16 15:24, Amit Kapila wrote:
>>>
>>>
>>> Thoughts?
>>>
>>> Note - To use this patch, first apply latest version of concurrent
>>> hash index patch [2].
>>>
>>> [1] - https://commitfest.postgresql.org/10/647/
>>> [2] -
>>> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com
>>>
>>>
>>
>> Firstly - really nice! Patches applied easily etc to latest version 10
>> checkout.
>>
>> I thought I'd test by initializing pgbench schema, adding a standby, then
>> adding some hash indexes and running pgbench:
>>
>> Looking on the standby:
>>
>> bench=# \d pgbench_accounts
>>    Table "public.pgbench_accounts"
>>   Column  |     Type      | Modifiers
>> ----------+---------------+-----------
>>  aid      | integer       | not null
>>  bid      | integer       |
>>  abalance | integer       |
>>  filler   | character(84) |
>> Indexes:
>>     "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
>>     "pgbench_accounts_bid" hash (bid)    <====
>>
>> bench=# \d pgbench_history
>>           Table "public.pgbench_history"
>>  Column |            Type             | Modifiers
>> --------+-----------------------------+-----------
>>  tid    | integer                     |
>>  bid    | integer                     |
>>  aid    | integer                     |
>>  delta  | integer                     |
>>  mtime  | timestamp without time zone |
>>  filler | character(22)               |
>> Indexes:
>>     "pgbench_history_bid" hash (bid)     <=====
>>
>>
>> they have been replicated there ok.
>>
>> However I'm seeing a hang on the master after a while:
>>
>> bench=# SELECT datname,application_name,state,now()-xact_start AS
>> wait,query FROM pg_stat_activity ORDER BY wait DESC;
>>  datname | application_name | state  |      wait | query
>>
>>
---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
>>          | walreceiver      | idle   |                 |
>>  bench   | pgbench          | active | 00:31:38.367467 | INSERT INTO
>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
>> -3868, CURRENT_TIMESTAMP);
>>  bench   | pgbench          | active | 00:31:38.215378 | INSERT INTO
>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
>> 2091, CURRENT_TIMESTAMP);
>>  bench   | pgbench          | active | 00:31:35.991056 | INSERT INTO
>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
>> 3438, CURRENT_TIMESTAMP);
>>  bench   | pgbench          | active | 00:31:35.619798 | INSERT INTO
>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
>> -2443, CURRENT_TIMESTAMP);
>>  bench   | pgbench          | active | 00:31:35.544196 | INSERT INTO
>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
>> CURRENT_TIMESTAMP);
>>  bench   | pgbench          | active | 00:31:35.334504 | UPDATE
>> pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
>>  bench   | pgbench          | active | 00:31:35.234112 | UPDATE
>> pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
>>  bench   | pgbench          | active | 00:31:34.434676 | UPDATE
>> pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
>>  bench   | psql             | active | 00:00:00        | SELECT
>> datname,application_name,state,now()-xact_start AS wait,query FROM
>> pg_stat_activity ORDER BY wait DESC;
>> (10 rows)
>>
>> but no errors in the logs, any thoughts?
>

Can you get the call stacks?

>
>
> FWIW, retesting with the wal logging patch removed (i.e leaving the
> concurrent hast one) works fine.
>

Okay, information noted.


Thanks for testing and showing interest in the patch.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 24/08/16 15:36, Amit Kapila wrote:
> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>> On 24/08/16 12:09, Mark Kirkwood wrote:
>>> On 23/08/16 15:24, Amit Kapila wrote:
>>>>
>>>> Thoughts?
>>>>
>>>> Note - To use this patch, first apply latest version of concurrent
>>>> hash index patch [2].
>>>>
>>>> [1] - https://commitfest.postgresql.org/10/647/
>>>> [2] -
>>>> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com
>>>>
>>>>
>>> Firstly - really nice! Patches applied easily etc to latest version 10
>>> checkout.
>>>
>>> I thought I'd test by initializing pgbench schema, adding a standby, then
>>> adding some hash indexes and running pgbench:
>>>
>>> Looking on the standby:
>>>
>>> bench=# \d pgbench_accounts
>>>     Table "public.pgbench_accounts"
>>>    Column  |     Type      | Modifiers
>>> ----------+---------------+-----------
>>>   aid      | integer       | not null
>>>   bid      | integer       |
>>>   abalance | integer       |
>>>   filler   | character(84) |
>>> Indexes:
>>>      "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
>>>      "pgbench_accounts_bid" hash (bid)    <====
>>>
>>> bench=# \d pgbench_history
>>>            Table "public.pgbench_history"
>>>   Column |            Type             | Modifiers
>>> --------+-----------------------------+-----------
>>>   tid    | integer                     |
>>>   bid    | integer                     |
>>>   aid    | integer                     |
>>>   delta  | integer                     |
>>>   mtime  | timestamp without time zone |
>>>   filler | character(22)               |
>>> Indexes:
>>>      "pgbench_history_bid" hash (bid)     <=====
>>>
>>>
>>> they have been replicated there ok.
>>>
>>> However I'm seeing a hang on the master after a while:
>>>
>>> bench=# SELECT datname,application_name,state,now()-xact_start AS
>>> wait,query FROM pg_stat_activity ORDER BY wait DESC;
>>>   datname | application_name | state  |      wait | query
>>>
>>>
---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
>>>           | walreceiver      | idle   |                 |
>>>   bench   | pgbench          | active | 00:31:38.367467 | INSERT INTO
>>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
>>> -3868, CURRENT_TIMESTAMP);
>>>   bench   | pgbench          | active | 00:31:38.215378 | INSERT INTO
>>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
>>> 2091, CURRENT_TIMESTAMP);
>>>   bench   | pgbench          | active | 00:31:35.991056 | INSERT INTO
>>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
>>> 3438, CURRENT_TIMESTAMP);
>>>   bench   | pgbench          | active | 00:31:35.619798 | INSERT INTO
>>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
>>> -2443, CURRENT_TIMESTAMP);
>>>   bench   | pgbench          | active | 00:31:35.544196 | INSERT INTO
>>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
>>> CURRENT_TIMESTAMP);
>>>   bench   | pgbench          | active | 00:31:35.334504 | UPDATE
>>> pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
>>>   bench   | pgbench          | active | 00:31:35.234112 | UPDATE
>>> pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
>>>   bench   | pgbench          | active | 00:31:34.434676 | UPDATE
>>> pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
>>>   bench   | psql             | active | 00:00:00        | SELECT
>>> datname,application_name,state,now()-xact_start AS wait,query FROM
>>> pg_stat_activity ORDER BY wait DESC;
>>> (10 rows)
>>>
>>> but no errors in the logs, any thoughts?
> Can you get the call stacks?
>

For every stuck backend? (just double checking)...



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 24/08/16 15:36, Amit Kapila wrote:
>>
>> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
>> <mark.kirkwood@catalyst.net.nz> wrote:
>>>
>>
>> Can you get the call stacks?
>>
>
> For every stuck backend? (just double checking)...

Yeah.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 24/08/16 16:33, Amit Kapila wrote:
> On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>> On 24/08/16 15:36, Amit Kapila wrote:
>>> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
>>> <mark.kirkwood@catalyst.net.nz> wrote:
>>> Can you get the call stacks?
>>>
>> For every stuck backend? (just double checking)...
> Yeah.
>

Cool,

I managed to reproduce with a reduced workload of 4 backends, then
noticed that the traces for 3 of 'em were all the same. So I've attached
the 2 unique ones, plus noted what pg_stat_activity thought the wait
event was, in case that is useful.

Cheers

Mark


Attachment

Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 24/08/16 16:52, Mark Kirkwood wrote:
> On 24/08/16 16:33, Amit Kapila wrote:
>> On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
>> <mark.kirkwood@catalyst.net.nz> wrote:
>>> On 24/08/16 15:36, Amit Kapila wrote:
>>>> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
>>>> <mark.kirkwood@catalyst.net.nz> wrote:
>>>> Can you get the call stacks?
>>>>
>>> For every stuck backend? (just double checking)...
>> Yeah.
>>
>
> Cool,
>
> I managed to reproduce with a reduced workload of 4 backends, then
> noticed that the traces for 3 of 'em were all the same. So I've
> attached the 2 unique ones, plus noted what pg_stat_activity thought
> the wait event was, in case that is useful.

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.



Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Hi Amit,
>
> Thanks for working on this.
>
> When building with --enable-cassert, I get compiler warning:
>
> hash.c: In function 'hashbucketcleanup':
> hash.c:722: warning: 'new_bucket' may be used uninitialized in this function
>

This warning is from concurrent index patch. I will fix it and post
the patch on that thread.

>
> After an intentionally created crash, I get an Assert triggering:
>
> TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
> (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
>
> freep[0] is zero and bitmapbit is 16.
>

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set.  Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit.  Then while freeing
the overflow page it can hit the Assert as mentioned by you.  I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue.  As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 24/08/16 17:01, Mark Kirkwood wrote:
>
> ...actually I was wrong there, only 2 of them were the same. So I've
> attached a new log of bt's of them all.
>
>
>

And I can reproduce with only 1 session, figured that might be a helpful
piece of the puzzle (trace attached).



Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 11:44 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 24/08/16 17:01, Mark Kirkwood wrote:
>>
>>
>> ...actually I was wrong there, only 2 of them were the same. So I've
>> attached a new log of bt's of them all.
>>
>>
>>
>
> And I can reproduce with only 1 session, figured that might be a helpful
> piece of the puzzle (trace attached).
>

Thanks.

I think I know the problem here.  Basically _hash_freeovflpage() is
trying to take a lock on a buffer previous to overflow page to update
the links and it is quite possible that the same buffer is already
locked for moving the tuples while squeezing the bucket.  I am working
on a fix for the same.

Coincidently, Ashutosh Sharma a colleague of mine who was also testing
this patch found the same issue by an attached sql script.  So we
might be able to inculcate a test case  in the regression suite as
well after fix.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

>
> After an intentionally created crash, I get an Assert triggering:
>
> TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
> (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
>
> freep[0] is zero and bitmapbit is 16.
>

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set.  Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit.  Then while freeing
the overflow page it can hit the Assert as mentioned by you.  I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue.  As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.


The test is rather awkward, it might be easier to just have me test it.  But, I've attached it.

There is a patch that needs to applied and compiled (alongside your patches, of course), to inject the crashes.  A perl script which creates the schema and does the updates.  And a shell script which sets up the cluster with the appropriate parameters, and then calls the perl script in a loop.

The top of the shell script has some hard coded paths to the binaries, and to the test data directory (which is automatically deleted)

I run it like "sh do.sh >& do.err &"  

It gives two different types of assertion failures:

$ fgrep TRAP: do.err |sort|uniq -c

     21 TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
     32 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)

The second one is related to the intentional crashes, and so is not relevant to you.

Cheers,

Jeff
Attachment

Re: Write Ahead Logging for Hash Indexes

From
Alvaro Herrera
Date:
Amit Kapila wrote:
> $SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h?  Removing the existing #include
"xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
preliminary header cleanup commits.

I think it's silly that access/hash.h is the go-to header for hash
usage, including hash_any().  Not this patch's fault, of course, just
venting.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Kapila wrote:
>> $SUBJECT will make hash indexes reliable and usable on standby.
>
> Nice work.
>
> Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
> instead of cramming it in hash.h?  Removing the existing #include
> "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
> preliminary header cleanup commits.
>

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it.  Is that
right?  If yes, I can work on it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Alvaro Herrera
Date:
Amit Kapila wrote:
> On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
> > instead of cramming it in hash.h?  Removing the existing #include
> > "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
> > preliminary header cleanup commits.
> 
> So, what you are expecting here is that we create hash_xlog.h and move
> necessary functions like hash_redo(), hash_desc() and hash_identify()
> to it. Then do whatever else is required to build it successfully.
> Once that is done, I can build my patches on top of it.  Is that
> right?  If yes, I can work on it.

Yes, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Aug 25, 2016 at 6:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Kapila wrote:
>> On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>
>> > Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
>> > instead of cramming it in hash.h?  Removing the existing #include
>> > "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
>> > preliminary header cleanup commits.
>>
>> So, what you are expecting here is that we create hash_xlog.h and move
>> necessary functions like hash_redo(), hash_desc() and hash_identify()
>> to it. Then do whatever else is required to build it successfully.
>> Once that is done, I can build my patches on top of it.  Is that
>> right?  If yes, I can work on it.
>
> Yes, thanks.
>

How about attached?  If you want, I think we can one step further and
move hash_redo to a new file hash_xlog.c which is required for main
patch, but we can leave it for later as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Alvaro Herrera
Date:
Amit Kapila wrote:

> How about attached?

That works; pushed.  (I removed a few #includes from the new file.)

> If you want, I think we can one step further and move hash_redo to a
> new file hash_xlog.c which is required for main patch, but we can
> leave it for later as well.

I think that can be a part of the main patch.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Tue, Aug 30, 2016 at 3:40 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Kapila wrote:
>
>> How about attached?
>
> That works; pushed.

Thanks.

> (I removed a few #includes from the new file.)
>

oops, copied from hash.h and forgot to remove those.

>> If you want, I think we can one step further and move hash_redo to a
>> new file hash_xlog.c which is required for main patch, but we can
>> leave it for later as well.
>
> I think that can be a part of the main patch.
>

Okay, makes sense.  Any suggestions on splitting the main patch or in
general on design/implementation is welcome.

I will rebase the patches on the latest commit.  Current status is
that, I have fixed the problems reported by Jeff Janes and Mark
Kirkwood.  Currently, we are doing the stress testing of the patch
using pgbench, once that is done, will post a new version.  I am
hoping that it will be complete in this week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> >
>> > After an intentionally created crash, I get an Assert triggering:
>> >
>> > TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
>> > (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
>> >
>> > freep[0] is zero and bitmapbit is 16.
>> >
>>
>> Here what is happening is that when it tries to clear the bitmapbit,
>> it expects it to be set.  Now, I think the reason for why it didn't
>> find the bit as set could be that after the new overflow page is added
>> and the bit corresponding to it is set, you might have crashed the
>> system and the replay would not have set the bit.  Then while freeing
>> the overflow page it can hit the Assert as mentioned by you.  I think
>> the problem here could be that I am using REGBUF_STANDARD to log the
>> bitmap page updates which seems to be causing the issue.  As bitmap
>> page doesn't follow the standard page layout, it would have omitted
>> the actual contents while taking full page image and then during
>> replay, it would not have set the bit, because page doesn't need REDO.
>> I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
>> buffers.
>>
>> If you can send me the detailed steps for how you have produced the
>> problem, then I can verify after fixing whether you are seeing the
>> same problem or something else.
>
>
>
> The test is rather awkward, it might be easier to just have me test it.
>

Okay, I have fixed this issue as explained above.  Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation).  I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended).  Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Sep 7, 2016 at 3:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Okay, I have fixed this issue as explained above.  Apart from that, I
> have fixed another issue reported by Mark Kirkwood upthread and few
> other issues found during internal testing by Ashutosh Sharma.
>

Forgot to mention that you need to apply the latest concurrent hash
index patch [1] before applying this patch.


[1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm%2Bqn0jZX31-obXrJw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Ashutosh Sharma
Date:
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts  Table "public.pgbench_accounts" Column  |     Type      | Modifiers
----------+---------------+-----------aid      | integer       | not nullbid      | integer       |abalance | integer
   |filler   | character(84) |
 
Indexes:   "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)   "pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history         Table "public.pgbench_history"Column |            Type             | Modifiers
--------+-----------------------------+-----------tid    | integer                     |bid    | integer
    |aid    | integer                     |delta  | integer                     |mtime  | timestamp without time zone
|filler| character(22)               |
 
Indexes:   "pgbench_history_bid" hash (bid)


2) I have also made use of the following tools for analyzing the hash
index tables created on master and standby.

a) pgstattuple : Using this tool, i could verfiy the tuple level
statistics which includes index table physical length, dead tuple
percentageand other infos on both master and standby. However, i could
see below error message when using this tool for one of the hash index
table on both master and standby server.

postgres=# SELECT * FROM pgstattuple('pgbench_history_bid');
ERROR:  index "pgbench_history_bid" contains unexpected zero page at
block 2482297

To investigate on this, i made use of pageinspect contrib module and
came to know that above block contains an uninitialised page. But,
this is quite possible in hash index as here the bucket split happens
in the power-of-2 and we may find some of the uninitialised bucket
pages.

b) pg_filedump : Using this tool, i could take a dump of all the hash
index tables on master and standy and compare the dump file to verify
if there is any difference between hash index pages on both master and
standby.

In short, this is all i did to verify the patch for wal logging in hash index.

I would like thank Amit and Robert for their valuable inputs during
the hash index testing phase.

I am also planning to verify wal logging in hash index using Kuntal's
patch for WAL consistency check once it is stablized.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 07/09/16 21:58, Amit Kapila wrote:

> On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>> On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>>> After an intentionally created crash, I get an Assert triggering:
>>>>
>>>> TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
>>>> (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
>>>>
>>>> freep[0] is zero and bitmapbit is 16.
>>>>
>>> Here what is happening is that when it tries to clear the bitmapbit,
>>> it expects it to be set.  Now, I think the reason for why it didn't
>>> find the bit as set could be that after the new overflow page is added
>>> and the bit corresponding to it is set, you might have crashed the
>>> system and the replay would not have set the bit.  Then while freeing
>>> the overflow page it can hit the Assert as mentioned by you.  I think
>>> the problem here could be that I am using REGBUF_STANDARD to log the
>>> bitmap page updates which seems to be causing the issue.  As bitmap
>>> page doesn't follow the standard page layout, it would have omitted
>>> the actual contents while taking full page image and then during
>>> replay, it would not have set the bit, because page doesn't need REDO.
>>> I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
>>> buffers.
>>>
>>> If you can send me the detailed steps for how you have produced the
>>> problem, then I can verify after fixing whether you are seeing the
>>> same problem or something else.
>>
>>
>> The test is rather awkward, it might be easier to just have me test it.
>>
> Okay, I have fixed this issue as explained above.  Apart from that, I
> have fixed another issue reported by Mark Kirkwood upthread and few
> other issues found during internal testing by Ashutosh Sharma.
>
> The locking issue reported by Mark and Ashutosh is that the patch
> didn't maintain the locking order while adding overflow page as it
> maintains in other write operations (lock the bucket pages first and
> then metapage to perform the write operation).  I have added the
> comments in _hash_addovflpage() to explain the locking order used in
> modified patch.
>
> During stress testing with pgbench using master-standby setup, we
> found an issue which indicates that WAL replay machinery doesn't
> expect completely zeroed pages (See explanation of RBM_NORMAL mode
> atop XLogReadBufferExtended).  Previously before freeing the overflow
> page we were zeroing it, now I have changed it to just initialize the
> page such that the page will be empty.
>
> Apart from above, I have added support for old snapshot threshold in
> the hash index code.
>
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.
>
> I forgot to mention in my initial mail that Robert and I had some
> off-list discussions about the design of this patch, many thanks to
> him for providing inputs.
>
>

Repeating my tests with these new patches applied points to the hang 
issue being solved. I tested several 10 minute runs (any of which was 
enough to elicit the hang previously). I'll do some longer ones, but 
looks good!

regards

Mark



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Sep 8, 2016 at 10:02 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
>
> Repeating my tests with these new patches applied points to the hang issue
> being solved. I tested several 10 minute runs (any of which was enough to
> elicit the hang previously). I'll do some longer ones, but looks good!
>

Thanks for verifying the issue and doing further testing of the patch.
It is really helpful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  |     Type      | Modifiers
----------+---------------+-----------
 aid      | integer       | not null
 bid      | integer       |
 abalance | integer       |
 filler   | character(84) |
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
    "pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
          Table "public.pgbench_history"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 tid    | integer                     |
 bid    | integer                     |
 aid    | integer                     |
 delta  | integer                     |
 mtime  | timestamp without time zone |
 filler | character(22)               |
Indexes:
    "pgbench_history_bid" hash (bid)


Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will never use hash indexes for searching, so it limits the amount of test coverage you will get.  While searching shouldn't generate novel types of WAL records (that I know of), it will generate locking and timing issues that might uncover bugs (if there are any left to uncover, of course).

I would drop the primary key on pgbench_accounts and replace it with a hash index and test it that way (except I don't have a 128 core machine at my disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index should not be an operational problem, because the built in pgbench runs never attempt to violate the constraints anyway. 

In fact, I'd replace all of the indexes on the rest of the pgbench tables with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to insert a lot of dummy tuples (ones with negative values in the pseudo-pk column, which are never queried by the core part of the harness) and deleting them at random intervals.  I think that none of pgbench's built in tests are likely to give the bucket splitting and squeezing code very much exercise.

Is there a way to gather statistics on how many of each type of WAL record are actually getting sent over the replication link?  The only way I can think of is to turn on wal archving as well as replication, then using pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any problems.  But I realized I forgot to compile with enable-casserts, to I will have to redo it to make sure the assertion failures have been fixed.  In my original testing I did very rarely get a deadlock (or some kind of hang), and I haven't seen that again so far.  It was probably the same source as the one Mark observed, and so the same fix.

Cheers,

Jeff

Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 09/09/16 07:09, Jeff Janes wrote:

> On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com 
> <mailto:ashu.coek88@gmail.com>> wrote:
>
>     > Thanks to Ashutosh Sharma for doing the testing of the patch and
>     > helping me in analyzing some of the above issues.
>
>     Hi All,
>
>     I would like to summarize the test-cases that i have executed for
>     validating WAL logging in hash index feature.
>
>     1) I have mainly ran the pgbench test with read-write workload at the
>     scale factor of 1000 and various client counts like 16, 64 and 128 for
>     time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
>     on highly configured power2 machine with 128 cores and 512GB of RAM. I
>     ran the test-case both with and without the replication setup.
>
>     Please note that i have changed the schema of pgbench tables created
>     during initialisation phase.
>
>     The new schema of pgbench tables looks as shown below on both master
>     and standby:
>
>     postgres=# \d pgbench_accounts
>        Table "public.pgbench_accounts"
>       Column  |     Type      | Modifiers
>     ----------+---------------+-----------
>      aid      | integer       | not null
>      bid      | integer       |
>      abalance | integer       |
>      filler   | character(84) |
>     Indexes:
>         "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
>         "pgbench_accounts_bid" hash (bid)
>
>     postgres=# \d pgbench_history
>               Table "public.pgbench_history"
>      Column |            Type             | Modifiers
>     --------+-----------------------------+-----------
>      tid    | integer                     |
>      bid    | integer                     |
>      aid    | integer                     |
>      delta  | integer                     |
>      mtime  | timestamp without time zone |
>      filler | character(22)               |
>     Indexes:
>         "pgbench_history_bid" hash (bid)
>
>
> Hi Ashutosh,
>
> This schema will test the maintenance of hash indexes, but it will 
> never use hash indexes for searching, so it limits the amount of test 
> coverage you will get.  While searching shouldn't generate novel types 
> of WAL records (that I know of), it will generate locking and timing 
> issues that might uncover bugs (if there are any left to uncover, of 
> course).
>
> I would drop the primary key on pgbench_accounts and replace it with a 
> hash index and test it that way (except I don't have a 128 core 
> machine at my disposal, so really I am suggesting that you do this...)
>
> The lack of primary key and the non-uniqueness of the hash index 
> should not be an operational problem, because the built in pgbench 
> runs never attempt to violate the constraints anyway.
>
> In fact, I'd replace all of the indexes on the rest of the pgbench 
> tables with hash indexes, too, just for additional testing.
>
> I plan to do testing using my own testing harness after changing it to 
> insert a lot of dummy tuples (ones with negative values in the 
> pseudo-pk column, which are never queried by the core part of the 
> harness) and deleting them at random intervals.  I think that none of 
> pgbench's built in tests are likely to give the bucket splitting and 
> squeezing code very much exercise.
>
> Is there a way to gather statistics on how many of each type of WAL 
> record are actually getting sent over the replication link?  The only 
> way I can think of is to turn on wal archving as well as replication, 
> then using pg_xlogdump to gather the stats.
>
> I've run my original test for a while now and have not seen any 
> problems.  But I realized I forgot to compile with enable-casserts, to 
> I will have to redo it to make sure the assertion failures have been 
> fixed.  In my original testing I did very rarely get a deadlock (or 
> some kind of hang), and I haven't seen that again so far.  It was 
> probably the same source as the one Mark observed, and so the same fix.
>
> Cheers,
>
> Jeff

Yeah, good suggestion about replacing (essentially) all the indexes with 
hash ones and testing. I did some short runs with this type of schema 
yesterday (actually to get a feel for if hash performance vs btree was 
compareable - does seem tp be) - but probably longer ones with higher 
concurrency (as high as I can manage on a single socket i7 anyway) is a 
good plan. If Ashutosh has access to seriously large numbers of cores 
then that is even better :-)

Cheers

Mark



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Sep 9, 2016 at 12:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> I plan to do testing using my own testing harness after changing it to
> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
> column, which are never queried by the core part of the harness) and
> deleting them at random intervals.  I think that none of pgbench's built in
> tests are likely to give the bucket splitting and squeezing code very much
> exercise.
>

Hash index tests [1] written by Mithun does cover part of that code
and we have done testing by extending those tests to cover splitting
and squeezing part of code.

> Is there a way to gather statistics on how many of each type of WAL record
> are actually getting sent over the replication link?  The only way I can
> think of is to turn on wal archving as well as replication, then using
> pg_xlogdump to gather the stats.
>

Sounds sensible, but what do you want to know by getting the number of
each type of WAL records?  I understand it is important to cover all
the WAL records for hash index (and I think Ashutosh has done that
during his tests [2]), but may be sending multiple times same record
could further strengthen the validation.

> I've run my original test for a while now and have not seen any problems.
> But I realized I forgot to compile with enable-casserts, to I will have to
> redo it to make sure the assertion failures have been fixed.  In my original
> testing I did very rarely get a deadlock (or some kind of hang), and I
> haven't seen that again so far.  It was probably the same source as the one
> Mark observed, and so the same fix.
>

Thanks for the verification.


[1] - https://commitfest.postgresql.org/10/716/
[2] - https://www.postgresql.org/message-id/CAE9k0PkPumi4iWFuD%2BjHHkpcxn531%3DDJ8uH0dctsvF%2BdaZY6yQ%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:

On 09/09/16 14:50, Mark Kirkwood wrote:
>
> Yeah, good suggestion about replacing (essentially) all the indexes 
> with hash ones and testing. I did some short runs with this type of 
> schema yesterday (actually to get a feel for if hash performance vs 
> btree was compareable - does seem tp be) - but probably longer ones 
> with higher concurrency (as high as I can manage on a single socket i7 
> anyway) is a good plan. If Ashutosh has access to seriously large 
> numbers of cores then that is even better :-)
>
>

I managed to find a slightly bigger server (used our openstack cloud to 
run a 16 cpu vm). With the schema modified as follows:

bench=# \d pgbench_accounts   Table "public.pgbench_accounts"  Column  |     Type      | Modifiers
----------+---------------+----------- aid      | integer       | not null bid      | integer       | abalance |
integer      | filler   | character(84) |
 
Indexes:    "pgbench_accounts_pkey" hash (aid)

bench=# \d pgbench_branches   Table "public.pgbench_branches"  Column  |     Type      | Modifiers
----------+---------------+----------- bid      | integer       | not null bbalance | integer       | filler   |
character(88)|
 
Indexes:    "pgbench_branches_pkey" hash (bid)

bench=# \d pgbench_tellers    Table "public.pgbench_tellers"  Column  |     Type      | Modifiers
----------+---------------+----------- tid      | integer       | not null bid      | integer       | tbalance |
integer      | filler   | character(84) |
 
Indexes:    "pgbench_tellers_pkey" hash (tid)

bench=# \d pgbench_history          Table "public.pgbench_history" Column |            Type             | Modifiers
--------+-----------------------------+----------- tid    | integer                     | bid    | integer
      | aid    | integer                     | delta  | integer                     | mtime  | timestamp without time
zone| filler | character(22)               |
 
Indexes:    "pgbench_history_pkey" hash (bid)

performed several 10 hour runs on size 100 database using 32 and 64 
clients. For the last run I rebuilt with assertions enabled. No hangs or 
assertion failures.

regards

Mark



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
>
>
> performed several 10 hour runs on size 100 database using 32 and 64 clients.
> For the last run I rebuilt with assertions enabled. No hangs or assertion
> failures.
>

Thanks for verification.  Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:

On 11/09/16 17:01, Amit Kapila wrote:
> On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>>
>> performed several 10 hour runs on size 100 database using 32 and 64 clients.
>> For the last run I rebuilt with assertions enabled. No hangs or assertion
>> failures.
>>
> Thanks for verification.  Do you think we can do some read-only
> workload benchmarking using this server?  If yes, then probably you
> can use concurrent hash index patch [1] and cache the metapage patch
> [2] (I think Mithun needs to rebase his patch) to do so.
>
>
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com
>
>

I can do - are we checking checking for hangs/assertions or comparing 
patched vs unpatched performance (for the metapage patch)?

regards

Mark



Re: Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 11/09/16 19:16, Mark Kirkwood wrote:

>
>
> On 11/09/16 17:01, Amit Kapila wrote:
>> ...Do you think we can do some read-only
>> workload benchmarking using this server?  If yes, then probably you
>> can use concurrent hash index patch [1] and cache the metapage patch
>> [2] (I think Mithun needs to rebase his patch) to do so.
>>
>>
>>
>> [1] - 
>> https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
>> [2] - 
>> https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com
>>
>>
>
> I can do - are we checking checking for hangs/assertions or comparing 
> patched vs unpatched performance (for the metapage patch)?
>
>

So, assuming the latter - testing performance with and without the 
metapage patch:

For my 1st runs:

- cpus 16, ran 16G
- size 100, clients 32

I'm seeing no difference in performance for read only (-S) pgbench 
workload (with everybody using has indexes). I guess not that surprising 
as the db fites in ram (1.6G and we have 16G). So I'll retry with a 
bigger dataset (suspect size 2000 is needed).

regards

Mark




Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
 
I plan to do testing using my own testing harness after changing it to insert a lot of dummy tuples (ones with negative values in the pseudo-pk column, which are never queried by the core part of the harness) and deleting them at random intervals.  I think that none of pgbench's built in tests are likely to give the bucket splitting and squeezing code very much exercise.


I've implemented this, by adding lines 197 through 202 to the count.pl script.  (I'm reattaching the test case)

Within a few minutes of testing, I start getting Errors like these:

29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set count=count+1 where index=$1


In one test, I also got an error from my test harness itself indicating tuples are transiently missing from the index, starting an hour into a test:

child abnormal exit update did not update 1 row: key 9555 updated 0E0 at count.pl line 194.\n  at count.pl line 208.
child abnormal exit update did not update 1 row: key 8870 updated 0E0 at count.pl line 194.\n  at count.pl line 208.
child abnormal exit update did not update 1 row: key 8453 updated 0E0 at count.pl line 194.\n  at count.pl line 208.

Those key values should always find exactly one row to update.

If the tuples were permanently missing from the index, I would keep getting errors on the same key values very frequently.  But I don't get that, the errors remain infrequent and are on different value each time, so I think the tuples are in the index but the scan somehow misses them, either while the bucket is being split or while it is being squeezed.

This on a build without enable-asserts.

Any ideas on how best to go about investigating this?

Cheers,

Jeff

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>>
>> I plan to do testing using my own testing harness after changing it to
>> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
>> column, which are never queried by the core part of the harness) and
>> deleting them at random intervals.  I think that none of pgbench's built in
>> tests are likely to give the bucket splitting and squeezing code very much
>> exercise.
>
>
>
> I've implemented this, by adding lines 197 through 202 to the count.pl
> script.  (I'm reattaching the test case)
>
> Within a few minutes of testing, I start getting Errors like these:
>
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
> owned by resource owner Portal
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
> count=count+1 where index=$1
>
>
> In one test, I also got an error from my test harness itself indicating
> tuples are transiently missing from the index, starting an hour into a test:
>
> child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
>
> Those key values should always find exactly one row to update.
>
> If the tuples were permanently missing from the index, I would keep getting
> errors on the same key values very frequently.  But I don't get that, the
> errors remain infrequent and are on different value each time, so I think
> the tuples are in the index but the scan somehow misses them, either while
> the bucket is being split or while it is being squeezed.
>
> This on a build without enable-asserts.
>
> Any ideas on how best to go about investigating this?
>

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch.  Is it possible to just try this with concurrent hash index
patch?  In any case, thanks for testing it, I will look into these
issues.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Sun, Sep 11, 2016 at 3:01 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 11/09/16 19:16, Mark Kirkwood wrote:
>
>>
>>
>> On 11/09/16 17:01, Amit Kapila wrote:
>>>
>>> ...Do you think we can do some read-only
>>> workload benchmarking using this server?  If yes, then probably you
>>> can use concurrent hash index patch [1] and cache the metapage patch
>>> [2] (I think Mithun needs to rebase his patch) to do so.
>>>
>>>
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
>>> [2] -
>>> https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com
>>>
>>>
>>
>> I can do - are we checking checking for hangs/assertions or comparing
>> patched vs unpatched performance (for the metapage patch)?
>>
>>

The first step is to find any bugs and then do performance testing.
What I wanted for performance testing was to compare HEAD against all
three patches (all three together) of Hash Index (concurrent hash
index, cache the meta page, WAL for hash index).  For Head, we want
two set of numbers, one with hash indexes and another with btree
indexes.  As Jeff has found few problems, I think it is better to
first fix those before going for performance tests.

>
> So, assuming the latter - testing performance with and without the metapage
> patch:
>
> For my 1st runs:
>
> - cpus 16, ran 16G
> - size 100, clients 32
>
> I'm seeing no difference in performance for read only (-S) pgbench workload
> (with everybody using has indexes). I guess not that surprising as the db
> fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset
> (suspect size 2000 is needed).
>

I think you should try with -S -M prepared and with various client
counts (8,16,32,64,100...).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Sun, Sep 11, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>>
>> I plan to do testing using my own testing harness after changing it to
>> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
>> column, which are never queried by the core part of the harness) and
>> deleting them at random intervals.  I think that none of pgbench's built in
>> tests are likely to give the bucket splitting and squeezing code very much
>> exercise.
>
>
>
> I've implemented this, by adding lines 197 through 202 to the count.pl
> script.  (I'm reattaching the test case)
>
> Within a few minutes of testing, I start getting Errors like these:
>
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
> owned by resource owner Portal
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
> count=count+1 where index=$1
>
>
> In one test, I also got an error from my test harness itself indicating
> tuples are transiently missing from the index, starting an hour into a test:
>
> child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
>
> Those key values should always find exactly one row to update.
>
> If the tuples were permanently missing from the index, I would keep getting
> errors on the same key values very frequently.  But I don't get that, the
> errors remain infrequent and are on different value each time, so I think
> the tuples are in the index but the scan somehow misses them, either while
> the bucket is being split or while it is being squeezed.
>
> This on a build without enable-asserts.
>
> Any ideas on how best to go about investigating this?
>

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch.  Is it possible to just try this with concurrent hash index
patch?  In any case, thanks for testing it, I will look into these
issues.

My test program (as posted) injects crashes and then checks the post-crash-recovery system for consistency, so it cannot be run as-is without the WAL patch.  I also ran the test with crashing turned off (just change the JJ* variables at the stop of the do.sh to all be set to the empty string), and in that case I didn't see either problem, but it it could just be that I that I didn't run it long enough.  

It should have been long enough to detect the rather common "buffer <x> is not owned by resource owner Portal" problem, so that one I think is specific to the WAL patch (probably the part which tries to complete bucket splits when it detects one was started but not completed?)


Cheers,

Jeff

Re: Write Ahead Logging for Hash Indexes

From
Jesper Pedersen
Date:
Hi,

On 09/07/2016 05:58 AM, Amit Kapila wrote:
> Okay, I have fixed this issue as explained above.  Apart from that, I
> have fixed another issue reported by Mark Kirkwood upthread and few
> other issues found during internal testing by Ashutosh Sharma.
>
> The locking issue reported by Mark and Ashutosh is that the patch
> didn't maintain the locking order while adding overflow page as it
> maintains in other write operations (lock the bucket pages first and
> then metapage to perform the write operation).  I have added the
> comments in _hash_addovflpage() to explain the locking order used in
> modified patch.
>
> During stress testing with pgbench using master-standby setup, we
> found an issue which indicates that WAL replay machinery doesn't
> expect completely zeroed pages (See explanation of RBM_NORMAL mode
> atop XLogReadBufferExtended).  Previously before freeing the overflow
> page we were zeroing it, now I have changed it to just initialize the
> page such that the page will be empty.
>
> Apart from above, I have added support for old snapshot threshold in
> the hash index code.
>
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.
>
> I forgot to mention in my initial mail that Robert and I had some
> off-list discussions about the design of this patch, many thanks to
> him for providing inputs.
>

Some initial feedback.

README:
+in_complete split flag.  The reader algorithm works correctly, as it 
will scan

What flag ?

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+        if (BufferIsValid(bukcetbuf));

->

+        if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

hash.h:

Move the XLog related content to hash_xlog.h

Best regards, Jesper




Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> My test program (as posted) injects crashes and then checks the
> post-crash-recovery system for consistency, so it cannot be run as-is
> without the WAL patch.  I also ran the test with crashing turned off (just
> change the JJ* variables at the stop of the do.sh to all be set to the empty
> string), and in that case I didn't see either problem, but it it could just
> be that I that I didn't run it long enough.
>
> It should have been long enough to detect the rather common "buffer <x> is
> not owned by resource owner Portal" problem, so that one I think is specific
> to the WAL patch (probably the part which tries to complete bucket splits
> when it detects one was started but not completed?)
>

Both the problem seems to be due to same reason and in concurrent hash
index patch.  So what is happening here is that we are trying to unpin
the old bucket buffer twice.  We need to scan the old bucket when
there is a incomplete-split, so the issue here is that during split
the system has crashed and after restart, during old bucket scan it
tries to unpin the old primary bucket buffer twice when the new bucket
has additional overflow pages.  I will post the updated patch on
concurrent hash index thread.  Once, I post the patch, it is better if
you try to reproduce the issue once more.

Thanks to Ashutosh Sharma who has offlist shared the call stack after
reproducing the problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Mon, Sep 12, 2016 at 10:24 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
>
> Some initial feedback.
>

Thanks for looking into patch.

> README:
> +in_complete split flag.  The reader algorithm works correctly, as it will
> scan
>
> What flag ?
>

in-complete-split flag which indicates that split has to be finished
for that particular bucket.  The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively.  It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.

> hashxlog.c:
>
> hash_xlog_move_page_contents
> hash_xlog_squeeze_page
>
> Both have "bukcetbuf" (-> "bucketbuf"), and
>
> +               if (BufferIsValid(bukcetbuf));
>
> ->
>
> +               if (BufferIsValid(bucketbuf))
>
> and indent following line:
>
> LockBufferForCleanup(bukcetbuf);
>
> hash_xlog_delete
>
> has the "if" issue too.
>

Fixed all the above cosmetic issues.

> hash.h:
>
> Move the XLog related content to hash_xlog.h
>

Moved and renamed hashxlog.c to hash_xlog.c to match the name with
corresponding .h file.

Note - This patch is to be applied on top of concurrent hash index v6
version [1].  Jeff, use this version of patch for further review and
test.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BERbP%2B7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Jesper Pedersen
Date:
On 09/13/2016 07:41 AM, Amit Kapila wrote:
>> README:
>> +in_complete split flag.  The reader algorithm works correctly, as it will
>> scan
>>
>> What flag ?
>>
>
> in-complete-split flag which indicates that split has to be finished
> for that particular bucket.  The value of these flags are
> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
> bucket respectively.  It is set in hasho_flag in special area of page.
> I have slightly expanded the definition in README, but not sure if it
> is good idea to mention flags defined in hash.h. Let me know, if still
> it is unclear or you want some thing additional to be added in README.
>

I think it is better now.

>> hashxlog.c:
>>
>> hash_xlog_move_page_contents
>> hash_xlog_squeeze_page
>>
>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>
>> +               if (BufferIsValid(bukcetbuf));
>>
>> ->
>>
>> +               if (BufferIsValid(bucketbuf))
>>
>> and indent following line:
>>
>> LockBufferForCleanup(bukcetbuf);
>>
>> hash_xlog_delete
>>
>> has the "if" issue too.
>>
>
> Fixed all the above cosmetic issues.
>

I meant there is an extra ';' on the "if" statements:

+               if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and 
hash_xlog_delete.

Best regards, Jesper






Re: Write Ahead Logging for Hash Indexes

From
Ashutosh Sharma
Date:
Hi All,

I am getting following error when running the test script shared by
Jeff -[1] . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
foo (index) select $1 from generate_series(1,10000)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,10000)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,10000)

[1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.


With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>
>>> README:
>>> +in_complete split flag.  The reader algorithm works correctly, as it
>>> will
>>> scan
>>>
>>> What flag ?
>>>
>>
>> in-complete-split flag which indicates that split has to be finished
>> for that particular bucket.  The value of these flags are
>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
>> bucket respectively.  It is set in hasho_flag in special area of page.
>> I have slightly expanded the definition in README, but not sure if it
>> is good idea to mention flags defined in hash.h. Let me know, if still
>> it is unclear or you want some thing additional to be added in README.
>>
>
> I think it is better now.
>
>>> hashxlog.c:
>>>
>>> hash_xlog_move_page_contents
>>> hash_xlog_squeeze_page
>>>
>>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>>
>>> +               if (BufferIsValid(bukcetbuf));
>>>
>>> ->
>>>
>>> +               if (BufferIsValid(bucketbuf))
>>>
>>> and indent following line:
>>>
>>> LockBufferForCleanup(bukcetbuf);
>>>
>>> hash_xlog_delete
>>>
>>> has the "if" issue too.
>>>
>>
>> Fixed all the above cosmetic issues.
>>
>
> I meant there is an extra ';' on the "if" statements:
>
> +               if (BufferIsValid(bukcetbuf)); <--
>
> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
> hash_xlog_delete.
>
>
> Best regards,
>  Jesper
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: Write Ahead Logging for Hash Indexes

From
Ashutosh Sharma
Date:
Hi All,

Below is the backtrace for the issue reported in my earlier mail [1].
From the callstack it looks like we are trying to release  lock on a
meta page twice in _hash_expandtable().

(gdb) bt
#0  0x00000000007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799
#1  0x000000000078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540
#2  0x00000000004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0,
buf=2, from_access=1, to_access=-1) at hashpage.c:331
#3  0x00000000004a722d in _hash_expandtable (rel=0x7f5605b608c0,
metabuf=2) at hashpage.c:995
#4  0x00000000004a316a in _hash_doinsert (rel=0x7f5605b608c0,
itup=0x2ba5030) at hashinsert.c:313
#5  0x00000000004a0d85 in hashinsert (rel=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4,
heapRel=0x7f5605b58250,   checkUnique=UNIQUE_CHECK_NO) at hash.c:248
#6  0x00000000004c5a16 in index_insert (indexRelation=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "",
heap_t_ctid=0x2c2e4f4,   heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204
#7  0x000000000063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0,
tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000',
specConflict=0x0,   arbiterIndexes=0x0) at execIndexing.c:388
#8  0x000000000066a1fc in ExecInsert (mtstate=0x2b99e50,
slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0,
onconflict=ONCONFLICT_NONE,   estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481
#9  0x000000000066b841 in ExecModifyTable (node=0x2b99e50) at
nodeModifyTable.c:1496
#10 0x0000000000645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396
#11 0x00000000006424d9 in ExecutePlan (estate=0x2b99c00,
planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT,
sendTuples=0 '\000',   numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20
<donothingDR>) at execMain.c:1567
#12 0x000000000064087a in standard_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x00007f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#14 0x0000000000640751 in ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:284
#15 0x00000000007c331e in ProcessQuery (plan=0x2b873f0,
sourceText=0x2b89bd0 "insert into foo (index) select $1 from
generate_series(1,10000)",   params=0x2b89c20, dest=0xd73c20 <donothingDR>,
completionTag=0x7ffdf540a3f0 "") at pquery.c:187
#16 0x00000000007c4a0c in PortalRunMulti (portal=0x2ae7930,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20
<donothingDR>,   altdest=0xd73c20 <donothingDR>, completionTag=0x7ffdf540a3f0 "")
at pquery.c:1303
#17 0x00000000007c4055 in PortalRun (portal=0x2ae7930,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90,
altdest=0x2b5dc90,   completionTag=0x7ffdf540a3f0 "") at pquery.c:815
#18 0x00000000007bfb45 in exec_execute_message (portal_name=0x2b5d880
"", max_rows=9223372036854775807) at postgres.c:1977
#19 0x00000000007c25c7 in PostgresMain (argc=1, argv=0x2b05df8,
dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at
postgres.c:4133
#20 0x0000000000744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260
#21 0x0000000000744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934
#22 0x0000000000740f9c in ServerLoop () at postmaster.c:1691
#23 0x0000000000740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at
postmaster.c:1299
#24 0x00000000006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228

Please let me know for any further inputs.

[1]- https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Hi All,
>
> I am getting following error when running the test script shared by
> Jeff -[1] . The error is observed upon executing the test script for
> around 3-4 hrs.
>
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
> buffer_content 1 is not held
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
> foo (index) select $1 from generate_series(1,10000)
>
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
> buffer_content 10 is not held
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,10000)
>
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
> buffer_content 10 is not held
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,10000)
>
> [1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com
>
> Please note that i am performing the test on latest patch for
> concurrent hash index and WAL log in hash index shared by Amit
> yesterday.
>
>
> With Regards,
> Ashutosh Sharma
> EnterpriseDB: http://www.enterprisedb.com
>
>
>
> On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>>
>>>> README:
>>>> +in_complete split flag.  The reader algorithm works correctly, as it
>>>> will
>>>> scan
>>>>
>>>> What flag ?
>>>>
>>>
>>> in-complete-split flag which indicates that split has to be finished
>>> for that particular bucket.  The value of these flags are
>>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
>>> bucket respectively.  It is set in hasho_flag in special area of page.
>>> I have slightly expanded the definition in README, but not sure if it
>>> is good idea to mention flags defined in hash.h. Let me know, if still
>>> it is unclear or you want some thing additional to be added in README.
>>>
>>
>> I think it is better now.
>>
>>>> hashxlog.c:
>>>>
>>>> hash_xlog_move_page_contents
>>>> hash_xlog_squeeze_page
>>>>
>>>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>>>
>>>> +               if (BufferIsValid(bukcetbuf));
>>>>
>>>> ->
>>>>
>>>> +               if (BufferIsValid(bucketbuf))
>>>>
>>>> and indent following line:
>>>>
>>>> LockBufferForCleanup(bukcetbuf);
>>>>
>>>> hash_xlog_delete
>>>>
>>>> has the "if" issue too.
>>>>
>>>
>>> Fixed all the above cosmetic issues.
>>>
>>
>> I meant there is an extra ';' on the "if" statements:
>>
>> +               if (BufferIsValid(bukcetbuf)); <--
>>
>> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
>> hash_xlog_delete.
>>
>>
>> Best regards,
>>  Jesper
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Hi All,
>
> Below is the backtrace for the issue reported in my earlier mail [1].
> From the callstack it looks like we are trying to release  lock on a
> meta page twice in _hash_expandtable().
>

Thanks for the call stack.  I think below code in patch is culprit.
Here we have already released the meta page lock and then again on
failure, we are trying to release it.

_hash_expandtable()
{
..

/* Release the metapage lock, before completing the split. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
..

if (!buf_nblkno)
{
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
..
fail:
/* We didn't write the metapage, so just drop lock */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
}

This is a problem of concurrent hash index patch.  I will send the fix
in next version of the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>

>
> I meant there is an extra ';' on the "if" statements:
>
> +               if (BufferIsValid(bukcetbuf)); <--
>
> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
> hash_xlog_delete.
>

Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1].  I have also updated README.



[1] - https://www.postgresql.org/message-id/CAA4eK1%2BzT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1].  I have also updated README.


with v7 of the concurrent has patch and v4 of the write ahead log patch and the latest relcache patch (I don't know how important that is to reproducing this, I suspect it is not), I once got this error:


38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was interrupted; last known up at 2016-09-19 16:25:49 PDT
38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not properly shut down; automatic recovery in progress
38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of relation base/16384/17334
38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T


The original page with the invalid checksum is:

$ od 16384_17334_9
0000000 000032 000000 015420 077347 020404 000000 000030 017760
0000020 017760 020004 000000 000000 000000 000000 000000 000000
0000040 000000 000000 000000 000000 000000 000000 000000 000000
*
0017760 000007 000002 177777 177777 000007 000000 000002 177600
0020000

If I ignore the checksum failure and re-start the system, the page gets restored to be a bitmap page.

Cheers,

Jeff

Re: Write Ahead Logging for Hash Indexes

From
Jesper Pedersen
Date:
On 09/16/2016 02:42 AM, Amit Kapila wrote:
> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> that, I have changed _hash_alloc_buckets() to initialize the page
> instead of making it completely Zero because of problems discussed in
> another related thread [1].  I have also updated README.
>

Thanks.

This needs a rebase against the CHI v8 [1] patch.

[1] 
https://www.postgresql.org/message-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com

Best regards, Jesper




Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
!         } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
!         } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Correct.  But any torn page write must be covered by the restoration of a full page image during replay, shouldn't it?  And that restoration should happen blindly, without first reading in the old page and verifying the checksum.  Failure to restore the page from a FPI would be a bug.  (That was the purpose for which I wrote this testing harness in the first place, to verify that the restoration of FPI happens correctly; although most of the bugs it happens to uncover have been unrelated to that.)

 

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

I think Ashutosh's examples are of warnings, not errors.   I think the warnings occur when replay needs to read in the block (for reason's I don't understand yet) but then doesn't care if it passes the checksum or not because it will just be blown away by the replay anyway.  


Does this explanation explains the reason of problem you are seeing?

If it can't survive artificial torn page writes, then it probably can't survive reals ones either.  So I am pretty sure it is a bug of some sort.  Perhaps the bug is that it is generating an ERROR when should just be a WARNING?
 

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


I haven't done that yet.  I can't start the system without destroying the evidence, and I haven't figured out yet how to import a specific block from a shut-down server into a bytea of a running server, in order to inspect it using pageinspect.

Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

I'm afraid that is over my head.  I can study it until it makes sense, but it will take me a while.

Cheers,

Jeff


[1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> >> that, I have changed _hash_alloc_buckets() to initialize the page
>> >> instead of making it completely Zero because of problems discussed in
>> >> another related thread [1].  I have also updated README.
>> >>
>> >
>> > with v7 of the concurrent has patch and v4 of the write ahead log patch
>> > and
>> > the latest relcache patch (I don't know how important that is to
>> > reproducing
>> > this, I suspect it is not), I once got this error:
>> >
>> >
>> > 38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was
>> > interrupted; last known up at 2016-09-19 16:25:49 PDT
>> > 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
>> > properly shut down; automatic recovery in progress
>> > 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at
>> > 3F/2200DE90
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
>> > failed,
>> > calculated checksum 65067 but expected 21260
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
>> > of
>> > relation base/16384/17334
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> >
>> >
>> > The original page with the invalid checksum is:
>> >
>>
>> I think this is a example of torn page problem, which seems to be
>> happening because of the below code in your test.
>>
>> !         if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
>> !RecoveryInProgress()) {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
>> ! ereport(FATAL,
>> ! (errcode(ERRCODE_DISK_FULL),
>> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
>> bytes",
>> ! blocknum,
>> ! relpath(reln->smgr_rnode, forknum),
>> ! nbytes, BLCKSZ),
>> !  errhint("JJ is screwing with the database.")));
>> !         } else {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
>> ! }
>>
>> If you are running the above test by disabling JJ_torn_page, then it
>> is a different matter and we need to investigate it, but l assume you
>> are running by enabling it.
>>
>> I think this could happen if the actual change in page is in 2/3 part
>> of page which you are not writing in above code.  The checksum in page
>> header which is written as part of partial page write (1/3 part of
>> page) would have considered the actual change you have made whereas
>> after restart when it again read the page to apply redo, the checksum
>> calculation won't include the change being made in 2/3 part.
>
>
> Correct.  But any torn page write must be covered by the restoration of a
> full page image during replay, shouldn't it?  And that restoration should
> happen blindly, without first reading in the old page and verifying the
> checksum.
>

Probably, but I think this is not currently the way it is handled and
I don't want to change it.  AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image.  The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index.  The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there.  Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don't think we should try to avoid WARNING for torn pages as part of
this patch, even if that is better course of action, but certainly
FATAL should be fixed and a WARNING should be displayed instead as it
is displayed for heap pages.

>
>> Today, while thinking on this problem, I realized that currently in
>> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
>> problem reported by you [1].  That change will fix the problem
>> reported by you, but it will expose bitmap pages for torn-page
>> hazards.  I think the right fix there is to make pd_lower equal to
>> pd_upper for bitmap page, so that full page writes doesn't exclude the
>> data in bitmappage.
>
>
> I'm afraid that is over my head.  I can study it until it makes sense, but
> it will take me a while.
>

Does the explanation above helps, if not, then tell me, I will try to
explain it once more.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>>
>> Correct.  But any torn page write must be covered by the restoration of a
>> full page image during replay, shouldn't it?  And that restoration should
>> happen blindly, without first reading in the old page and verifying the
>> checksum.
>>
>
> Probably, but I think this is not currently the way it is handled and
> I don't want to change it.  AFAIU, what happens now is that we first
> read the old page (and we do page verification while reading old page
> and display *warning* if checksum doesn't match) and then restore the
> image.  The relevant code path is
>
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().
>
> Now, here the point is that why instead of WARNING we are seeing FATAL
> for the bitmap page of hash index.  The reason as explained in my
> previous e-mail is that for bitmap page we are not logging full page
> image and we should fix that as explained there.  Once the full page
> image is logged, then first time it reads the torn page, it will use
> flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
> to WARNING.
>

I think here I am slightly wrong.  For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image.  So after my fix,
you will not see the error of checksum failure.  I have a fix ready,
but still doing some more verification.  If everything passes, I will
share the patch in a day or so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think here I am slightly wrong.  For the full page writes, it do use
> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
> doing page verification check and rather blindly setting the page to
> zero and then overwrites it with full page image.  So after my fix,
> you will not see the error of checksum failure.  I have a fix ready,
> but still doing some more verification.  If everything passes, I will
> share the patch in a day or so.
>

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages.  Apart from that, I have rebased the patch based on
latest concurrent index patch [1].  I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.


[1] - https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> I think here I am slightly wrong.  For the full page writes, it do use
>> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> doing page verification check and rather blindly setting the page to
>> zero and then overwrites it with full page image.  So after my fix,
>> you will not see the error of checksum failure.  I have a fix ready,
>> but still doing some more verification.  If everything passes, I will
>> share the patch in a day or so.
>>
>
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem.  Also, he has found
a way to easily reproduce the problem.  Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Ashutosh Sharma
Date:
Hi All,

> I forgot to mention that Ashutosh has tested this patch for a day
> using Jeff's tool and he didn't found any problem.  Also, he has found
> a way to easily reproduce the problem.  Ashutosh, can you share your
> changes to the script using which you have reproduce the problem?

I made slight changes in Jeff Janes patch (crash_REL10.patch) to
reproduce the issue reported by him earlier [1]. I changed the
crash_REL10.patch such that a torn page write happens only for a
bitmap page in hash index. As described by Amit in [2] & [3] we were
not logging full page image for bitmap page and that's the reason we
were not be able to restore bitmap page in case it is a torn page
which would eventually result in below error.

38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block
9 of relation base/16384/17334

Jeff Jane's original patch had following conditions for a torn page write:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress())   nbytes = FileWrite(v->mdfd_vfd, buffer,
BLCKSZ/3);

Ashutosh has added some more condition's in above if( ) to ensure that
the issue associated with bitmap page gets reproduced easily:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID
&& bucket_opaque->hasho_flag & LH_BITMAP_PAGE)   nbytes = FileWrite(v->mdfd_vfd, buffer, 24);

Also, please note that i am allowing only 24 bytes of data i.e. just a
page header to be written into a disk file so that even if a single
overflow page is added into hash index table the issue will be
observed.

Note: You can find Jeff Jane's patch for torn page write at - [4].

[1] - https://www.postgresql.org/message-id/CAMkU%3D1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN%2BXQw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr%3DAdKbTEg%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAA4eK1%2Bk9tGPw7vOACRo%2B4h5n%3DutHnSuGgMoJrLANybAjNBW9w%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Jesper Pedersen
Date:
On 09/25/2016 01:00 AM, Amit Kapila wrote:
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

I have been running various tests, and applications with this patch 
together with the CHI v8 patch [1].

As I havn't seen any failures and doesn't currently have additional 
feedback I'm moving this patch to "Ready for Committer" for their feedback.

If others have comments, move the patch status back in the CommitFest 
application, please.

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com

Best regards, Jesper




Re: Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think here I am slightly wrong.  For the full page writes, it do use
> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
> doing page verification check and rather blindly setting the page to
> zero and then overwrites it with full page image.  So after my fix,
> you will not see the error of checksum failure.  I have a fix ready,
> but still doing some more verification.  If everything passes, I will
> share the patch in a day or so.
>

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages.  Apart from that, I have rebased the patch based on
latest concurrent index patch [1].  I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

This needs to be updated to apply over concurrent_hash_index_v10.patch.

Unless we want to wait until that work is committed before doing more review and testing on this.

Thanks,

Jeff

Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > I think here I am slightly wrong.  For the full page writes, it do use
>> > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> > doing page verification check and rather blindly setting the page to
>> > zero and then overwrites it with full page image.  So after my fix,
>> > you will not see the error of checksum failure.  I have a fix ready,
>> > but still doing some more verification.  If everything passes, I will
>> > share the patch in a day or so.
>> >
>>
>> Attached patch fixes the problem, now we do perform full page writes
>> for bitmap pages.  Apart from that, I have rebased the patch based on
>> latest concurrent index patch [1].  I have updated the README as well
>> to reflect the WAL logging related information for different
>> operations.
>>
>> With attached patch, all the review comments or issues found till now
>> are addressed.
>
>
> This needs to be updated to apply over concurrent_hash_index_v10.patch.
>
> Unless we want to wait until that work is committed before doing more review
> and testing on this.
>

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well.  So, I think after it gets
somewhat stabilized, I will update this patch as well.  I am not sure
if it is good idea to update it with every version of hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> Unless we want to wait until that work is committed before doing more review
>> and testing on this.
>>
>
> The concurrent hash index patch is getting changed and some of the
> changes needs change in this patch as well.  So, I think after it gets
> somewhat stabilized, I will update this patch as well.
>

Now that concurrent hash index patch is committed [1], I will work on
rebasing this patch.  Note, I have moved this to next CF.


[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6d46f4783efe457f74816a75173eb23ed8930020

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Unless we want to wait until that work is committed before doing more review
>>> and testing on this.
>>
>> The concurrent hash index patch is getting changed and some of the
>> changes needs change in this patch as well.  So, I think after it gets
>> somewhat stabilized, I will update this patch as well.
>
> Now that concurrent hash index patch is committed [1], I will work on
> rebasing this patch.  Note, I have moved this to next CF.

Thanks.  I am thinking that it might make sense to try to get the
"microvacuum support for hash index" and "cache hash index meta page"
patches committed before this one, because I'm guessing they are much
simpler than this one, and I think therefore that the review of those
patches can probably move fairly quickly.  Of course, ideally I can
also start reviewing this one in the meantime.  Does that make sense
to you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Dec 1, 2016 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>> Unless we want to wait until that work is committed before doing more review
>>>> and testing on this.
>>>
>>> The concurrent hash index patch is getting changed and some of the
>>> changes needs change in this patch as well.  So, I think after it gets
>>> somewhat stabilized, I will update this patch as well.
>>
>> Now that concurrent hash index patch is committed [1], I will work on
>> rebasing this patch.  Note, I have moved this to next CF.
>
> Thanks.  I am thinking that it might make sense to try to get the
> "microvacuum support for hash index" and "cache hash index meta page"
> patches committed before this one, because I'm guessing they are much
> simpler than this one, and I think therefore that the review of those
> patches can probably move fairly quickly.
>

I think it makes sense to move "cache hash index meta page" first,
however "microvacuum support for hash index" is based on WAL patch as
the action in this patch (delete op) also needs to be logged.  One
idea could be that we can try to split the patch so that WAL logging
can be done as a separate patch, but I am not sure if it is worth.

>  Of course, ideally I can
> also start reviewing this one in the meantime.  Does that make sense
> to you?
>

You can start reviewing some of the operations like "Create Index",
"Insert".  However, some changes are required because of change in
locking strategy for Vacuum.  I am planning to work on rebasing it and
making required changes in next week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Thanks.  I am thinking that it might make sense to try to get the
>> "microvacuum support for hash index" and "cache hash index meta page"
>> patches committed before this one, because I'm guessing they are much
>> simpler than this one, and I think therefore that the review of those
>> patches can probably move fairly quickly.
>
> I think it makes sense to move "cache hash index meta page" first,
> however "microvacuum support for hash index" is based on WAL patch as
> the action in this patch (delete op) also needs to be logged.  One
> idea could be that we can try to split the patch so that WAL logging
> can be done as a separate patch, but I am not sure if it is worth.

The thing is, there's a fair amount locking stupidity in what just got
committed because of the requirement that the TID doesn't decrease
within a page.  I'd really like to get that fixed.

>>  Of course, ideally I can
>> also start reviewing this one in the meantime.  Does that make sense
>> to you?
>>
>
> You can start reviewing some of the operations like "Create Index",
> "Insert".  However, some changes are required because of change in
> locking strategy for Vacuum.  I am planning to work on rebasing it and
> making required changes in next week.

I'll review after that, since I have other things to review meanwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Dec 2, 2016 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>  Of course, ideally I can
>>> also start reviewing this one in the meantime.  Does that make sense
>>> to you?
>>>
>>
>> You can start reviewing some of the operations like "Create Index",
>> "Insert".  However, some changes are required because of change in
>> locking strategy for Vacuum.  I am planning to work on rebasing it and
>> making required changes in next week.
>
> I'll review after that, since I have other things to review meanwhile.
>

Attached, please find the rebased patch attached with this e-mail.
There is no fundamental change in patch except for adapting the new
locking strategy in squeeze operation.  I have done the sanity testing
of the patch with master-standby setup and Kuntal has helped me to
verify it with his wal-consistency checker patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I'll review after that, since I have other things to review meanwhile.
>
> Attached, please find the rebased patch attached with this e-mail.
> There is no fundamental change in patch except for adapting the new
> locking strategy in squeeze operation.  I have done the sanity testing
> of the patch with master-standby setup and Kuntal has helped me to
> verify it with his wal-consistency checker patch.

This patch again needs a rebase, but before you do that I'd like to
make it harder by applying the attached patch to remove
_hash_chgbufaccess(), which I think is a bad plan for more or less the
same reasons that motivated the removal of _hash_wrtbuf() in commit
25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
more simplification and cleanup that can be done afterward in the wake
of this; what I've done here is just a mechanical replacement of
_hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().  The
point is that having MarkBufferDirty() calls happen implicitly instead
some other function is not what we want for write-ahead logging.  Your
patch gets rid of that, too; this is just doing it somewhat more
thoroughly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I'll review after that, since I have other things to review meanwhile.
>>
>> Attached, please find the rebased patch attached with this e-mail.
>> There is no fundamental change in patch except for adapting the new
>> locking strategy in squeeze operation.  I have done the sanity testing
>> of the patch with master-standby setup and Kuntal has helped me to
>> verify it with his wal-consistency checker patch.
>
> This patch again needs a rebase,
>

I had completed it yesterday night and kept it for night long tests to
check the sanity of the patch, but I guess now I need another rebase.
Anyway, I feel this is all for the betterment of final patch.

> but before you do that I'd like to
> make it harder by applying the attached patch to remove
> _hash_chgbufaccess(), which I think is a bad plan for more or less the
> same reasons that motivated the removal of _hash_wrtbuf() in commit
> 25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
> more simplification and cleanup that can be done afterward in the wake
> of this; what I've done here is just a mechanical replacement of
> _hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().

The patch has replaced usage of HASH_READ/HASH_WRITE with
BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE which will make hash code
using two types of defines for the same purpose.  Now, we can say that
it is better to replace HASH_READ/HASH_WRITE in whole hash index code
or maybe the usage this patch is introducing is okay, however, it
seems like btree is using similar terminology (BT_READ/BT_WRITE).
Other than that your patch looks good.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Dec 23, 2016 at 6:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> I'll review after that, since I have other things to review meanwhile.
>>>
>>> Attached, please find the rebased patch attached with this e-mail.
>>> There is no fundamental change in patch except for adapting the new
>>> locking strategy in squeeze operation.  I have done the sanity testing
>>> of the patch with master-standby setup and Kuntal has helped me to
>>> verify it with his wal-consistency checker patch.
>>
>> This patch again needs a rebase,
>>
>
> I had completed it yesterday night and kept it for night long tests to
> check the sanity of the patch, but I guess now I need another rebase.
>

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase.  Attached is the rebased patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Jesper Pedersen
Date:
On 12/27/2016 01:58 AM, Amit Kapila wrote:
> After recent commit's 7819ba1e and 25216c98, this patch requires a
> rebase.  Attached is the rebased patch.
>

This needs a rebase after commit e898437.

Best regards, Jesper




Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>
>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>> rebase.  Attached is the rebased patch.
>>
>
> This needs a rebase after commit e898437.
>

Attached find the rebased patch.

Thanks!


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Michael Paquier
Date:
On Fri, Jan 13, 2017 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>>
>>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>>> rebase.  Attached is the rebased patch.
>>>
>>
>> This needs a rebase after commit e898437.
>>
>
> Attached find the rebased patch.

The patch applies, and there have been no reviews since the last
version has been submitted. So moved to CF 2017-03.
-- 
Michael



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Jeff Janes
Date:
On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>
>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>> rebase.  Attached is the rebased patch.
>>
>
> This needs a rebase after commit e898437.
>

Attached find the rebased patch.

Thanks!

I've put this through a lot of crash-recovery testing using different variations of my previously posted testing harness, and have not had any problems.

I occasionally get deadlocks which are properly detected.  I think this is just a user-space problem, not a bug, but will describe it anyway just in case...

What happens is that connections occasionally create 10,000 nuisance tuples all using the same randomly chosen negative integer index value (the one the hash index is on), and then some time later delete those tuples using the memorized negative index value, to force the page split and squeeze to get exercised.  If two connections happen to choose the same negative index for their nuisance tuples, and then try to delete "their" tuples at the same time, and they are deleting the same 20,000 "nuisance" tuples concurrently with a bucket split/squeeze/something, they might see the tuples in a different order from each other and so deadlock against each other waiting on each others transaction locks due to "locked" tuples.  Is that a plausible explanation?

Cheers,

Jeff

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Sat, Feb 4, 2017 at 4:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>> > On 12/27/2016 01:58 AM, Amit Kapila wrote:
>> >>
>> >> After recent commit's 7819ba1e and 25216c98, this patch requires a
>> >> rebase.  Attached is the rebased patch.
>> >>
>> >
>> > This needs a rebase after commit e898437.
>> >
>>
>> Attached find the rebased patch.
>>
>> Thanks!
>
>
> I've put this through a lot of crash-recovery testing using different
> variations of my previously posted testing harness, and have not had any
> problems.
>

Thanks for detailed testing of this patch.

> I occasionally get deadlocks which are properly detected.  I think this is
> just a user-space problem, not a bug, but will describe it anyway just in
> case...
>
> What happens is that connections occasionally create 10,000 nuisance tuples
> all using the same randomly chosen negative integer index value (the one the
> hash index is on), and then some time later delete those tuples using the
> memorized negative index value, to force the page split and squeeze to get
> exercised.  If two connections happen to choose the same negative index for
> their nuisance tuples, and then try to delete "their" tuples at the same
> time, and they are deleting the same 20,000 "nuisance" tuples concurrently
> with a bucket split/squeeze/something, they might see the tuples in a
> different order from each other and so deadlock against each other waiting
> on each others transaction locks due to "locked" tuples.  Is that a
> plausible explanation?
>

Yes, that is right explanation for the deadlock you are seeing.  A few
days back Jesper and Ashutosh Sharma have also faced the same problem
[1] with the similar test (non-unique values in hash index).  To
explain in brief, how this can happen is that squeeze operation moves
tuples from the far end of the bucket to rear end and it is quite
possible that different backends can see tuples with same values in a
different order.


[1] - https://www.postgresql.org/message-id/CAE9k0PmBGrNO_s%2ByKn72VO--ypaXWemOsizsWr5cHjEv0X4ERg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>>> rebase.  Attached is the rebased patch.
>>
>> This needs a rebase after commit e898437.
>
> Attached find the rebased patch.

Well, I've managed to break this again by committing more things.

I think it would be helpful if you broke this into a series of
preliminary refactoring patches and then a final patch that actually
adds WAL-logging. The things that look like preliminary refactoring to
me are:

- Adding _hash_pgaddmultitup and using it in various places.
- Adding and freeing overflow pages has been extensively reworked.
- Similarly, there is some refactoring of how bitmap pages get initialized.
- Index initialization has been rejiggered significantly.
- Bucket splits have been rejiggered.

Individually those changes don't look all that troublesome to review,
but altogether it's quite a lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>>> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>>>> rebase.  Attached is the rebased patch.
>>>
>>> This needs a rebase after commit e898437.
>>
>> Attached find the rebased patch.
>
> Well, I've managed to break this again by committing more things.
>

I have again rebased it, this time it requires few more changes due to
the previous commits.  As we have to store hashm_maxbucket number in
prevblock number of primary bucket page, we need to take of same
during replay wherever applicable.  Also, there were few new tests in
contrib modules which were failing.  Attached patch is still a rebased
version of the previous patch, I will work on splitting it as per your
suggestion below, but I thought it is better to post the rebased patch
as well.

> I think it would be helpful if you broke this into a series of
> preliminary refactoring patches and then a final patch that actually
> adds WAL-logging.

Okay, that makes sense to me as well.

> The things that look like preliminary refactoring to
> me are:
>
> - Adding _hash_pgaddmultitup and using it in various places.
> - Adding and freeing overflow pages has been extensively reworked.
>

Freeing the overflow page is too tightly coupled with changes related
to _hash_pgaddmultitup, so it might be better to keep it along with
it.  However, I think we can prepare a separate patch for changes
related to adding the overflow page.

> - Similarly, there is some refactoring of how bitmap pages get initialized.
> - Index initialization has been rejiggered significantly.
> - Bucket splits have been rejiggered.
>

Will try to prepare separate patches for the above three.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Feb 10, 2017 at 12:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> The things that look like preliminary refactoring to
>> me are:
>>
>> - Adding _hash_pgaddmultitup and using it in various places.
>> - Adding and freeing overflow pages has been extensively reworked.
>>
>
> Freeing the overflow page is too tightly coupled with changes related
> to _hash_pgaddmultitup, so it might be better to keep it along with
> it.  However, I think we can prepare a separate patch for changes
> related to adding the overflow page.
>
>> - Similarly, there is some refactoring of how bitmap pages get initialized.
>> - Index initialization has been rejiggered significantly.
>> - Bucket splits have been rejiggered.
>>
>

As discussed, attached are refactoring patches and a patch to enable
WAL for the hash index on top of them.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Kuntal Ghosh
Date:
On Mon, Feb 13, 2017 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As discussed, attached are refactoring patches and a patch to enable
> WAL for the hash index on top of them.

0006-Enable-WAL-for-Hash-Indexes.patch needs to be rebased after
commit 8da9a226369e9ceec7cef1.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As discussed, attached are refactoring patches and a patch to enable
> WAL for the hash index on top of them.

Thanks.  I think that the refactoring patches shouldn't add
START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
for the final patch.  Nor should their comments reference the idea of
writing WAL; that should also be for the final patch.

PageGetFreeSpaceForMulTups doesn't seem like a great name.
PageGetFreeSpaceForMultipleTuples?  Or maybe just use
PageGetExactFreeSpace and then do the account in the caller.  I'm not
sure it's really worth having a function just to subtract a multiple
of sizeof(ItemIdData), and it would actually be more efficient to have
the caller take care of this, since you wouldn't need to keep
recalculating the value for every iteration of the loop.

I think we ought to just rip (as a preliminary patch) out the support
for freeing an overflow page in the middle of a bucket chain.  That
code is impossible to hit, I believe, and instead of complicating the
WAL machinery to cater to a nonexistent case, maybe we ought to just
get rid of it.

+                               /*
+                                * We need to release and if required
reacquire the lock on
+                                * rbuf to ensure that standby
shouldn't see an intermediate
+                                * state of it.
+                                */

How does releasing and reacquiring the lock on the master affect
whether the standby can see an intermediate state?  That sounds
totally wrong to me (and also doesn't belong in a refactoring patch
anyway since we're not emitting any WAL here).

-       Assert(PageIsNew(page));

This worries me a bit.  What's going on here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> As discussed, attached are refactoring patches and a patch to enable
>> WAL for the hash index on top of them.
>
> Thanks.  I think that the refactoring patches shouldn't add
> START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
> for the final patch.  Nor should their comments reference the idea of
> writing WAL; that should also be for the final patch.
>

Okay, changed as per suggestion.

> PageGetFreeSpaceForMulTups doesn't seem like a great name.
> PageGetFreeSpaceForMultipleTuples?

Okay, changed as per suggestion.

>  Or maybe just use
> PageGetExactFreeSpace and then do the account in the caller.  I'm not
> sure it's really worth having a function just to subtract a multiple
> of sizeof(ItemIdData), and it would actually be more efficient to have
> the caller take care of this, since you wouldn't need to keep
> recalculating the value for every iteration of the loop.
>

I have tried this one, but I think even if track outside we need
something like PageGetFreeSpaceForMultipleTuples for the cases when we
have to try next write page/'s where data (set of index tuples) can be
accommodated.

> I think we ought to just rip (as a preliminary patch) out the support
> for freeing an overflow page in the middle of a bucket chain.  That
> code is impossible to hit, I believe, and instead of complicating the
> WAL machinery to cater to a nonexistent case, maybe we ought to just
> get rid of it.
>

I also could not think of a case where we need to care about the page
in the middle of bucket chain as per it's current usage.  In
particular, are you worried about the code related to nextblock number
in _hash_freeovflpage()?   Surely, we can remove it, but what
complexity are you referring to here?   There is some additional
book-keeping for primary bucket page, but there is nothing much about
maintaining a backward link.  One more thing to note is that this is
an exposed API, so to avoid getting it used in some wrong way, we
might want to make it static.

> +                               /*
> +                                * We need to release and if required
> reacquire the lock on
> +                                * rbuf to ensure that standby
> shouldn't see an intermediate
> +                                * state of it.
> +                                */
>
> How does releasing and reacquiring the lock on the master affect
> whether the standby can see an intermediate state?

I am talking about the intermediate state of standby with respect to
the master.  We will release the lock on standby after replaying the
WAL for moving tuples from read page to write page whereas master will
hold that for the much longer period (till we move all the tuples from
read page and free that page).  I understand that there is no
correctness issue here, but was trying to make operations on master
and standby similar and another thing is that it will help us in
obeying the coding rule of releasing the lock on buffers after writing
WAL record.  I see no harm in maintaining the existing coding pattern,
however, if you think that is not important in this case, then I can
change the code to avoid releasing the lock on read page.

>  That sounds
> totally wrong to me (and also doesn't belong in a refactoring patch
> anyway since we're not emitting any WAL here).
>

Agreed that even if we want to do such a change, then also it should
be part of WAL patch.  So for now, I have reverted that change.

> -       Assert(PageIsNew(page));
>
> This worries me a bit.  What's going on here?
>

This is related to below change.

+ /*
+ * Initialise the freed overflow page, here we can't complete zeroed the
+ *
page as WAL replay routines expect pages to be initialized. See
+ * explanation of RBM_NORMAL
mode atop XLogReadBufferExtended.
+ */
+ _hash_pageinit(ovflpage, BufferGetPageSize
(ovflbuf));

Basically, we need this routine to initialize freed overflow pages.
Also, if you see the similar function in btree (_bt_pageinit(Page
page, Size size)), it doesn't have any such Assertion.

Attached are refactoring patches.  WAL patch needs some changes based
on above comments, so will post it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Attached are refactoring patches.  WAL patch needs some changes based
> on above comments, so will post it later.
>

Attached is a rebased patch to enable WAL logging for hash indexes.
Note, that this needs to be applied on top of refactoring patches [1]
sent in the previous e-mail.

[1]  - https://www.postgresql.org/message-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_%2B-oCRtRkdqPrQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Attached are refactoring patches.  WAL patch needs some changes based
> on above comments, so will post it later.

After some study, I have committed 0001, and also committed 0002 and
0003 as a single commit, with only minor tweaks.

I haven't made a full study of 0004 yet, but I'm a bit concerned about
a couple of the hunks, specifically this one:

@@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,
                if (PageGetFreeSpace(npage) < itemsz)                {
-                    /* write out nbuf and drop lock, but keep pin */
-                    MarkBufferDirty(nbuf);
+                    /* drop lock, but keep pin */                    LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+                    /* chain to a new overflow page */                    nbuf = _hash_addovflpage(rel, metabuf,
nbuf,
(nbuf == bucket_nbuf) ? true : false);                    npage = BufferGetPage(nbuf);

And also this one:

@@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,     * To avoid deadlocks due to locking order of buckets,
firstlock the old     * bucket and then the new bucket.     */
 
-    if (nbuf == bucket_nbuf)
-    {
-        MarkBufferDirty(bucket_nbuf);
-        LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-    }
-    else
-    {
-        MarkBufferDirty(nbuf);
-        _hash_relbuf(rel, nbuf);
-    }
-    LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);    opage = BufferGetPage(bucket_obuf);    oopaque =
(HashPageOpaque)PageGetSpecialPointer(opage);
 

I haven't quite grasped what's going on here, but it looks like those
MarkBufferDirty() calls didn't move someplace else, but rather just
vanished.  That would seem to be not good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Mon, Feb 27, 2017 at 11:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Attached are refactoring patches.  WAL patch needs some changes based
>> on above comments, so will post it later.
>
> After some study, I have committed 0001, and also committed 0002 and
> 0003 as a single commit, with only minor tweaks.
>
> I haven't made a full study of 0004 yet, but I'm a bit concerned about
> a couple of the hunks, specifically this one:
>
> @@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,
>
>                  if (PageGetFreeSpace(npage) < itemsz)
>                  {
> -                    /* write out nbuf and drop lock, but keep pin */
> -                    MarkBufferDirty(nbuf);
> +                    /* drop lock, but keep pin */
>                      LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
> +
>                      /* chain to a new overflow page */
>                      nbuf = _hash_addovflpage(rel, metabuf, nbuf,
> (nbuf == bucket_nbuf) ? true : false);
>                      npage = BufferGetPage(nbuf);
>
> And also this one:
>
> @@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,
>       * To avoid deadlocks due to locking order of buckets, first lock the old
>       * bucket and then the new bucket.
>       */
> -    if (nbuf == bucket_nbuf)
> -    {
> -        MarkBufferDirty(bucket_nbuf);
> -        LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
> -    }
> -    else
> -    {
> -        MarkBufferDirty(nbuf);
> -        _hash_relbuf(rel, nbuf);
> -    }
> -
>      LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
>      opage = BufferGetPage(bucket_obuf);
>      oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>
> I haven't quite grasped what's going on here, but it looks like those
> MarkBufferDirty() calls didn't move someplace else, but rather just
> vanished.  That would seem to be not good.
>

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
> but I think as this patch is standalone, so we should not remove it
> from their existing usage, I have added those back and rebased the
> remaining patches.

Great, thanks.  0001 looks good to me now, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>> but I think as this patch is standalone, so we should not remove it
>> from their existing usage, I have added those back and rebased the
>> remaining patches.
>
> Great, thanks.  0001 looks good to me now, so committed.

Committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 5:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>>> but I think as this patch is standalone, so we should not remove it
>>> from their existing usage, I have added those back and rebased the
>>> remaining patches.
>>
>> Great, thanks.  0001 looks good to me now, so committed.
>
> Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

+         affected by this setting.  Example include system catalogs.  For

Not grammatical.  Perhaps "affected by this setting, such as system catalogs".

-    mark meta page dirty and release buffer content lock and pin
+     mark meta page dirty and write WAL for update of metapage
+     release buffer content lock and pin

Was the change in the indentation level intentional?

+accomodated in the initial set of buckets and it is not unusual to have large

Spelling.

+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the

Need two commas and an "or": (a) removing tuples from some of the
bucket pages, (b) before clearing the garbage flag, or (c)

+quite possible, that system crashes before completing the operation on

No comma.  Also, "that the system crashes".

+the index will remain bloated and can impact performance of read and

and this can impact

+    /*
+     * we need to release and reacquire the lock on overflow buffer to ensure
+     * that standby shouldn't see an intermediate state of it.
+     */
+    LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+    LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);

I still think this is a bad idea.  Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.
What it does do is make the code on the master more complicated, since
now the page returned by _hash_addovflpage might theoretically fill up
while you are busy releasing and reacquiring the lock.

+ *    Add the tuples (itups) to wbuf in this function, we could do that in the
+ *    caller as well.  The advantage of doing it here is we can easily write

Change comma to period.  Then: "We could easily do that in the caller,
but the advantage..."

+     * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences.  Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

+         * To replay meta page changes, we can log the entire metapage which
+         * doesn't seem advisable considering size of hash metapage or we can
+         * log the individual updated values which seems doable, but we prefer
+         * to perform exact operations on metapage during replay as are done
+         * during actual operation.  That looks straight forward and has an
+         * advantage of using lesser space in WAL.

This sounds like it is describing two alternatives that were not
selected and then the one that was selected, but I don't understand
the difference between the second non-selected alternative and what
was actually picked.  I'd just write "We need only log the one field
from the metapage that has changed." or maybe drop the comment
altogether.

+                        XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);

The use of XLR_NORMAL_RDATAS here seems wrong.  Presumably you should
pass the number of rdatas you will actually need, which is some
constant plus nitups.  I think you should just write that constant
here directly, whether or not it happens to be equal to
XLR_NORMAL_RDATAS.

+                /*
+                 * We need to release and if required reacquire the lock on
+                 * rbuf to ensure that standby shouldn't see an intermediate
+                 * state of it.  If we don't release the lock, after replay of
+                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+                 * view the results of partial deletion on rblkno.
+                 */
+                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);

If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does.  This code doesn't
even execute on the standby, so how could it possibly affect what
intermediate states are visible there?

+        /*
+         * Here, we could have copied the entire metapage as well to WAL
+         * record and restore it as it is during replay.  However the size of
+         * metapage is not small (more than 600 bytes), so recording just the
+         * information required to construct metapage.
+         */

Again, this seems sorta obvious.  You've got a bunch of these comments
that say similar things.  I'd either make them shorter or just remove
them.

+         * We can log the exact changes made to meta page, however as no
+         * concurrent operation could see the index during the replay of this
+         * record, we can perform the operations during replay as they are
+         * done here.

Don't use a comma to join two sentences.  Also, I don't understand
anything up to the "however".  I think you mean something like "We
don't need hash index creation to be atomic from the point of view of
replay, because until the creating transaction commits it's not
visible to anything on the standby anyway."

+     * Set pd_lower just past the end of the metadata.  This is to log
+     * full_page_image of metapage in xloginsert.c.

Why the underscores?

+         * won't be a leak on standby, as next split will consume this space.

The master and the standby had better be the same, so I don't know
what it means to talk about a leak on the standby.

+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.

Uh, what?  We only recover things during recovery.  Nothing gets
recovered after recovery.

Maybe this could get some tests, via the isolation tester, for things
like snapshot too old?

I haven't gone through all of this in detail yet; I think most of it
is right on target.  But there is a lot more to look at yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Mar 8, 2017 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>>> but I think as this patch is standalone, so we should not remove it
>>> from their existing usage, I have added those back and rebased the
>>> remaining patches.
>>
>> Great, thanks.  0001 looks good to me now, so committed.
>
> Committed 0002.
>

Thanks.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Mar 8, 2017 at 5:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Here are some initial review thoughts on 0003 based on a first read-through.
>
>
> +    /*
> +     * we need to release and reacquire the lock on overflow buffer to ensure
> +     * that standby shouldn't see an intermediate state of it.
> +     */
> +    LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
> +    LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
>
> I still think this is a bad idea.  Releasing and reacquiring the lock
> on the master doesn't prevent the standby from seeing intermediate
> states; the comment, if I understand correctly, is just plain wrong.
>

I can remove this release and reacquire stuff, but first, let me try
to explain what intermediate state I am talking here.  If we don't
have a release/reacquire lock here, standby could see an empty
overflow page at that location whereas master will be able to see the
new page only after the insertion of tuple/'s in that bucket is
finished.  I understand that there is not much functionality impact of
this change but still wanted to make you understand why I have added
this code.  I am okay with removing this code if you still feel this
is just a marginal case and we should not bother about it. Let me know
what you think?

>
> +     * Initialise the freed overflow page, here we can't complete zeroed the
>
> Don't use British spelling, and don't use a comma to join two
> sentences.  Also "here we can't complete zeroed" doesn't seem right; I
> don't know what that means.

What the comment means to say is that we can't initialize page as zero
(something like below)
MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));

typo in sentence
/complete/completely

>
> +         * To replay meta page changes, we can log the entire metapage which
> +         * doesn't seem advisable considering size of hash metapage or we can
> +         * log the individual updated values which seems doable, but we prefer
> +         * to perform exact operations on metapage during replay as are done
> +         * during actual operation.  That looks straight forward and has an
> +         * advantage of using lesser space in WAL.
>
> This sounds like it is describing two alternatives that were not
> selected and then the one that was selected, but I don't understand
> the difference between the second non-selected alternative and what
> was actually picked.
>

Actually, the second alternative was picked.

> I'd just write "We need only log the one field
> from the metapage that has changed." or maybe drop the comment
> altogether.
>

I have added the comment so that others can understand why we don't
want to log the entire meta page (I think in some other indexes, we
always log the entire meta page, so such a question can arise).
However, after reading your comment, I think it is overkill and I
prefer to drop the entire comment unless you feel differently.

>
> +                /*
> +                 * We need to release and if required reacquire the lock on
> +                 * rbuf to ensure that standby shouldn't see an intermediate
> +                 * state of it.  If we don't release the lock, after replay of
> +                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
> be able to
> +                 * view the results of partial deletion on rblkno.
> +                 */
> +                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
>
> If you DO release the lock, this will STILL be true, because what
> matters on the standby is what the redo code does.
>

That's right, what I intend to do here is to release the lock in
master where it will be released in standby.  In this case, we can't
ensure what user can see in master is same as in standby after this
WAL record is replayed, because in master we have exclusive lock on
write buf, so no one can read the contents of read buf (the location
of read buf will be after write buf) whereas, in standby, it will be
possible to read the contents of the read buf.  I think this is not a
correctness issue, so we might want to leave as it is, what do you
say?

>
> +         * We can log the exact changes made to meta page, however as no
> +         * concurrent operation could see the index during the replay of this
> +         * record, we can perform the operations during replay as they are
> +         * done here.
>
> Don't use a comma to join two sentences.  Also, I don't understand
> anything up to the "however".
>

I mean the below changes made to meta buf (refer existing code):
metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;

metap->hashm_nmaps++;

I think it will be clear if you look both the DO and REDO operation of
XLOG_HASH_INIT_BITMAP_PAGE.  Let me know if you still think that the
comment needs to be changed?

>
> +     * Set pd_lower just past the end of the metadata.  This is to log
> +     * full_page_image of metapage in xloginsert.c.
>
> Why the underscores?
>
> +         * won't be a leak on standby, as next split will consume this space.
>

No specific reason, just trying to resemble full_page_writes, but can
change if you feel it doesn't make much sense.

> +         * won't be a leak on standby, as next split will consume this space.

> The master and the standby had better be the same,

That's right.

> so I don't know
> what it means to talk about a leak on the standby.
>

I just want to say that even if we fail after allocation of buckets
and before the split operation is completed, the newly allocated
buckets will neither be leaked on master nor on standby.  Do you think
we should add a comment for same or is it too obvious?  How about
changing it to something like:

"Even if we fail after this operation, it won't leak buckets, as next
split will consume this space.  In any case, even without failure, we
don't use all the space in one split operation."

> +                 * Change the shared buffer state in critical section,
> +                 * otherwise any error could make it unrecoverable after
> +                 * recovery.
>
> Uh, what?  We only recover things during recovery.  Nothing gets
> recovered after recovery.
>

Right, how about "Change the shared buffer state in a critical
section, otherwise any error could make the page unrecoverable."

> Maybe this could get some tests, via the isolation tester, for things
> like snapshot too old?
>

Okay, I can try, but note that currently there is no test related to
"snapshot too old" for any other indexes.

> I haven't gone through all of this in detail yet; I think most of it
> is right on target.  But there is a lot more to look at yet.
>

Thanks for your valuable comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I still think this is a bad idea.  Releasing and reacquiring the lock
>> on the master doesn't prevent the standby from seeing intermediate
>> states; the comment, if I understand correctly, is just plain wrong.
> I can remove this release and reacquire stuff, but first, let me try
> to explain what intermediate state I am talking here.  If we don't
> have a release/reacquire lock here, standby could see an empty
> overflow page at that location whereas master will be able to see the
> new page only after the insertion of tuple/'s in that bucket is
> finished.

That's true, but you keep speaking as if the release and reacquire
prevents the standby from seeing that state.  It doesn't.  Quite the
opposite: it allows the master to see that intermediate state as well.
I don't think there is a problem with the standby seeing an
intermediate state that can't be seen on the master, provided we've
set up the code to handle that intermediate state properly, which
hopefully we have.

>> +     * Initialise the freed overflow page, here we can't complete zeroed the
>>
>> Don't use British spelling, and don't use a comma to join two
>> sentences.  Also "here we can't complete zeroed" doesn't seem right; I
>> don't know what that means.
>
> What the comment means to say is that we can't initialize page as zero
> (something like below)
> MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
>
> typo in sentence
> /complete/completely

I think you can just say we initialize the page.  Saying we can't
initializing it by zeroing it seems pretty obvious, both from the code
and from the general theory of how stuff works.  If you did want to
explain it, I think you'd write something like "Just zeroing the page
won't work, because <reasons>."

>> +                /*
>> +                 * We need to release and if required reacquire the lock on
>> +                 * rbuf to ensure that standby shouldn't see an intermediate
>> +                 * state of it.  If we don't release the lock, after replay of
>> +                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
>> be able to
>> +                 * view the results of partial deletion on rblkno.
>> +                 */
>> +                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
>>
>> If you DO release the lock, this will STILL be true, because what
>> matters on the standby is what the redo code does.
>
> That's right, what I intend to do here is to release the lock in
> master where it will be released in standby.  In this case, we can't
> ensure what user can see in master is same as in standby after this
> WAL record is replayed, because in master we have exclusive lock on
> write buf, so no one can read the contents of read buf (the location
> of read buf will be after write buf) whereas, in standby, it will be
> possible to read the contents of the read buf.  I think this is not a
> correctness issue, so we might want to leave as it is, what do you
> say?

Well, as in the case above, you're doing extra work to make sure that
every state which can be observed on the standby can also be observed
on the master.  But that has no benefit, and it does have a cost: you
have to release and reacquire a lock that you could otherwise retain,
saving CPU cycles and code complexity.  So I think the way you've got
it is not good.

>> +         * We can log the exact changes made to meta page, however as no
>> +         * concurrent operation could see the index during the replay of this
>> +         * record, we can perform the operations during replay as they are
>> +         * done here.
>>
>> Don't use a comma to join two sentences.  Also, I don't understand
>> anything up to the "however".
>>
>
> I mean the below changes made to meta buf (refer existing code):
> metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
>
> metap->hashm_nmaps++;
>
> I think it will be clear if you look both the DO and REDO operation of
> XLOG_HASH_INIT_BITMAP_PAGE.  Let me know if you still think that the
> comment needs to be changed?

I think it's not very clear.  I think what this boils down to is that
this is another place where you've got a comment to explain that you
didn't log the entire metapage, but I don't think anyone would expect
that anyway, so it doesn't seem like a very important thing to
explain.  If you want a comment here, maybe something like /* This is
safe only because nobody else can be modifying the index at this
stage; it's only visible to the transaction that is creating it */

>> +     * Set pd_lower just past the end of the metadata.  This is to log
>> +     * full_page_image of metapage in xloginsert.c.
>>
>> Why the underscores?
>>
>> +         * won't be a leak on standby, as next split will consume this space.
>
> No specific reason, just trying to resemble full_page_writes, but can
> change if you feel it doesn't make much sense.

Well, usually I don't separate_words_in_a_sentence_with_underscores.
It makes sense to do that if you are referring to an identifier in the
code with that name, but otehrwise not.

>> +         * won't be a leak on standby, as next split will consume this space.
>
>> The master and the standby had better be the same,
>
> That's right.
>
>> so I don't know
>> what it means to talk about a leak on the standby.
>>
>
> I just want to say that even if we fail after allocation of buckets
> and before the split operation is completed, the newly allocated
> buckets will neither be leaked on master nor on standby.  Do you think
> we should add a comment for same or is it too obvious?  How about
> changing it to something like:
>
> "Even if we fail after this operation, it won't leak buckets, as next
> split will consume this space.  In any case, even without failure, we
> don't use all the space in one split operation."

Sure.

>> +                 * Change the shared buffer state in critical section,
>> +                 * otherwise any error could make it unrecoverable after
>> +                 * recovery.
>>
>> Uh, what?  We only recover things during recovery.  Nothing gets
>> recovered after recovery.
>
> Right, how about "Change the shared buffer state in a critical
> section, otherwise any error could make the page unrecoverable."

Sure.

>> Maybe this could get some tests, via the isolation tester, for things
>> like snapshot too old?
>
> Okay, I can try, but note that currently there is no test related to
> "snapshot too old" for any other indexes.

Wow, that's surprising.  It seems the snapshot_too_old test only
checks that this works for a table that has no indexes.  Have you,
anyway, tested it manually?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Great, thanks.  0001 looks good to me now, so committed.
>>
>> Committed 0002.
>
> Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:

The text you've added to the hash index README throws around the term
"single WAL entry" pretty freely.  For example: "An insertion that
causes an addition of an overflow page is logged as a single WAL entry
preceded by a WAL entry for new overflow page required to insert a
tuple."  When you say a single WAL entry, you make it sound like
there's only one, but because it's preceded by another WAL entry,
there are actually two.  I would rephase this to just avoid using the
word "single" this way.  For example: "If an insertion causes the
addition of an overflow page, there will be one WAL entry for the new
overflow page and a second entry for the insert itself."

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.

But VACUUM actually does try to complete the splits, so this is wrong, I think.

+Squeeze operation moves tuples from one of the buckets later in the chain to

A squeeze operation

+As Squeeze operation involves writing multiple atomic operations, it is

As a squeeze operation involves multiple atomic operations

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);

So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Not sure exactly what the fix should be.

+            /*
+             * As bitmap page doesn't have standard page layout, so this will
+             * allow us to log the data.
+             */
+            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));

If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD?  But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

(The comment is also a little confusing grammatically, but let's iron
out the substantive issue first.)

+        recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+        PageSetLSN(BufferGetPage(ovflbuf), recptr);
+        PageSetLSN(BufferGetPage(buf), recptr);
+
+        if (BufferIsValid(mapbuf))
+            PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+        if (BufferIsValid(newmapbuf))
+            PageSetLSN(BufferGetPage(newmapbuf), recptr);

I think you forgot to call PageSetLSN() on metabuf.  (There are 5
block references in the write-ahead log record but only 4 calls to
PageSetLSN ... surely that can't be right!)

+    /*
+     * We need to release the locks once the prev pointer of overflow bucket
+     * and next of left bucket are set, otherwise concurrent read might skip
+     * the bucket.
+     */
+    if (BufferIsValid(leftbuf))
+        UnlockReleaseBuffer(leftbuf);
+    UnlockReleaseBuffer(ovflbuf);

I don't believe the comment.  Postponing the lock release wouldn't
cause the concurrent read to skip the bucket.  It might cause it to be
delayed unnecessarily, but the following comment already addresses
that point.  I think you can just nuke this comment.

+        xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+        xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;

Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?

+    /*
+     * We release locks on writebuf and bucketbuf at end of replay operation
+     * to ensure that we hold lock on primary bucket page till end of
+     * operation.  We can optimize by releasing the lock on write buffer as
+     * soon as the operation for same is complete, if it is not same as
+     * primary bucket page, but that doesn't seem to be worth complicating the
+     * code.
+     */
+    if (BufferIsValid(writebuf))
+        UnlockReleaseBuffer(writebuf);
+
+    if (BufferIsValid(bucketbuf))
+        UnlockReleaseBuffer(bucketbuf);

Here in hash_xlog_squeeze_page(), you wait until the very end to
release these locks, but in e.g. hash_xlog_addovflpage you release
them considerably earlier for reasons that seem like they would also
be valid here.  I think you could move this back to before the updates
to the metapage and bitmap page.

+        /*
+         * Note: in normal operation, we'd update the metapage while still
+         * holding lock on the page we inserted into.  But during replay it's
+         * not necessary to hold that lock, since no other index updates can
+         * be happening concurrently.
+         */

This comment in hash_xlog_init_bitmap_page seems to be off-base,
because we didn't just insert into a page.  I'm not disputing that
it's safe, though: not only can nobody else be modifying it because
we're in recovery, but also nobody else can see it yet; the creating
transaction hasn't yet committed.

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */

Really?  Why not?  (Because it's part of the special space?)

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans?  And also hash_xlog_split_complete?
And hash_xlog_delete?  If the regular code path is getting a cleanup
lock to protect against concurrent scans, recovery better do it, too.

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);

And I'm not sure this is safe either.  The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated.  Seems like
it would be safer to keep all the locks for this one.

+     * Initialise the new bucket page, here we can't complete zeroed the page

I think maybe a good way to write these would be "we can't simply zero
the page".  And Initialise -> Initialize.
                /*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*                 * Insert tuple on new page, using _hash_pgaddtup to ensure                 *
correctordering by hashkey.  This is a tad inefficient                 * since we may have to shuffle itempointers
repeatedly.                * Possible future improvement: accumulate all the items for                 * the new page
andqsort them before insertion.                 */                (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 

+                END_CRIT_SECTION();

No way.  You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.

hash_redo's mappings of record types to function names is a little
less regular than seems desirable.  Generally, the function name is
the lower-case version of the record type with "hash" and "xlog"
switched.  But XLOG_HASH_ADD_OVFL_PAGE and
XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index.  Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master.  Maybe
you've already tried something like that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> +mode anyway).  It would seem natural to complete the split in VACUUM, but since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space.  That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space.  In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>

Vacuum just tries to clean the remaining tuples from old bucket.  Only
insert or split operation tries to complete the split.

>
> +                if (!xlrec.is_primary_bucket_page)
> +                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>

Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page.  Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock.   Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

>
> +            /*
> +             * As bitmap page doesn't have standard page layout, so this will
> +             * allow us to log the data.
> +             */
> +            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> +            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD?  But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it.  We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> +    /*
> +     * Note that we still update the page even if it was restored from a full
> +     * page image, because the bucket flag is not included in the image.
> +     */
>
> Really?  Why not?  (Because it's part of the special space?)
>

Yes, because it's part of special space.  Do you want that to
mentioned in comments?

> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
> guard against concurrent scans?
>

I don't think so, because regular code takes it on old bucket to
protect the split from concurrent inserts and cleanup lock on new
bucket is just for the sake of consistency.

> And also hash_xlog_split_complete?

In regular code, we are not taking cleanup lock for this operation.

> And hash_xlog_delete?  If the regular code path is getting a cleanup
> lock to protect against concurrent scans, recovery better do it, too.
>

We are already taking cleanup lock for this, not sure what exactly is
missed here, can you be more specific?

> +    /*
> +     * There is no harm in releasing the lock on old bucket before new bucket
> +     * during replay as no other bucket splits can be happening concurrently.
> +     */
> +    if (BufferIsValid(oldbuf))
> +        UnlockReleaseBuffer(oldbuf);
>
> And I'm not sure this is safe either.  The issue isn't that there
> might be another bucket split happening concurrently, but that there
> might be scans that get confused by seeing the bucket split flag set
> before the new bucket is created or the metapage updated.
>

During scan we check for bucket being populated, so this should be safe.

>  Seems like
> it would be safer to keep all the locks for this one.
>

If you feel better that way, I can change it and add a comment saying
something like "we could release lock on bucket being split early as
well, but doing here to be consistent with normal operation"

>
> It seems like a good test to do with this patch would be to set up a
> pgbench test on the master with a hash index replacing the usual btree
> index.  Then, set up a standby and run a read-only pgbench on the
> standby while a read-write pgbench test runs on the master.  Maybe
> you've already tried something like that?
>

Something similar has been done, but will again do with the final version.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Okay, I can try, but note that currently there is no test related to
>> "snapshot too old" for any other indexes.
>
> Wow, that's surprising.  It seems the snapshot_too_old test only
> checks that this works for a table that has no indexes.  Have you,
> anyway, tested it manually?
>

Yes, I have tested in manually.  I think we need to ensure that the
modified tuple falls on the same page as old tuple to make the test
work.  The slight difficulty with the index is to ensure the modified
tuple to be inserted into same page as old tuple, this is more true
with hash indexes.  Also, for heap, I think it relies on hot pruning
stuff and for index we need to perform manual vacuum.  Basically, if
we want we can write a test for index, but not sure if it is worth the
pain to write for hash index when the test for btree is not there.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> +mode anyway).  It would seem natural to complete the split in VACUUM, but since
>> +splitting a bucket might require allocating a new page, it might fail if you
>> +run out of disk space.  That would be bad during VACUUM - the reason for
>> +running VACUUM in the first place might be that you run out of disk space,
>> +and now VACUUM won't finish because you're out of disk space.  In contrast,
>> +an insertion can require enlarging the physical file anyway.
>>
>> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>
> Vacuum just tries to clean the remaining tuples from old bucket.  Only
> insert or split operation tries to complete the split.

Oh, right.  Sorry.

>> +                if (!xlrec.is_primary_bucket_page)
>> +                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>>
>> So, in hashbucketcleanup, the page is registered and therefore an FPI
>> will be taken if this is the first reference to this page since the
>> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
>> exposes us to a torn-page hazard.
>> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
>> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>
> Right, if we use XLogReadBufferForRedoExtended() instead of
> XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
> then it will restore FPI when required and can serve our purpose of
> acquiring clean up lock on primary bucket page.  Yet another way could
> be to store some information like block number, relfilenode, forknum
> (maybe we can get away with some less info) of primary bucket in the
> fixed part of each of these records and then using that read the page
> and then take a cleanup lock.   Now, as in most cases, this buffer
> needs to be registered only in cases when there are multiple overflow
> pages, I think having fixed information might hurt in some cases.

Just because the WAL record gets larger?  I think you could arrange to
record the data only in the cases where it is needed, but I'm also not
sure it would matter that much anyway.  Off-hand, it seems better than
having to mark the primary bucket page dirty (because you have to set
the LSN) every time any page in the chain is modified.

>> +    /*
>> +     * Note that we still update the page even if it was restored from a full
>> +     * page image, because the bucket flag is not included in the image.
>> +     */
>>
>> Really?  Why not?  (Because it's part of the special space?)
>
> Yes, because it's part of special space.  Do you want that to
> mentioned in comments?

Seems like a good idea.  Maybe just change the end to "because the
special space is not included in the image" to make it slightly more
explicit.

>> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
>> guard against concurrent scans?
>
> I don't think so, because regular code takes it on old bucket to
> protect the split from concurrent inserts and cleanup lock on new
> bucket is just for the sake of consistency.

You might be right, but this definitely does create a state on the
standby that can't occur on the master: the bucket being split is
flagged as such, but the bucket being populated doesn't appear to
exist yet.  Maybe that's OK.  It makes me slightly nervous to have the
standby take a weaker lock than the master does, but perhaps it's
harmless.

>> And also hash_xlog_split_complete?
>
> In regular code, we are not taking cleanup lock for this operation.

You're right, sorry.

>> And hash_xlog_delete?  If the regular code path is getting a cleanup
>> lock to protect against concurrent scans, recovery better do it, too.
>>
>
> We are already taking cleanup lock for this, not sure what exactly is
> missed here, can you be more specific?

Never mind, I'm wrong.

>> +    /*
>> +     * There is no harm in releasing the lock on old bucket before new bucket
>> +     * during replay as no other bucket splits can be happening concurrently.
>> +     */
>> +    if (BufferIsValid(oldbuf))
>> +        UnlockReleaseBuffer(oldbuf);
>>
>> And I'm not sure this is safe either.  The issue isn't that there
>> might be another bucket split happening concurrently, but that there
>> might be scans that get confused by seeing the bucket split flag set
>> before the new bucket is created or the metapage updated.
>
> During scan we check for bucket being populated, so this should be safe.

Yeah, I guess so.  This business of the standby taking weaker locks
still seems scary to me, as in hash_xlog_split_allocpage above.

>>  Seems like
>> it would be safer to keep all the locks for this one.
>
> If you feel better that way, I can change it and add a comment saying
> something like "we could release lock on bucket being split early as
> well, but doing here to be consistent with normal operation"

I think that might not be a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Right, if we use XLogReadBufferForRedoExtended() instead of
>> XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
>> then it will restore FPI when required and can serve our purpose of
>> acquiring clean up lock on primary bucket page.  Yet another way could
>> be to store some information like block number, relfilenode, forknum
>> (maybe we can get away with some less info) of primary bucket in the
>> fixed part of each of these records and then using that read the page
>> and then take a cleanup lock.   Now, as in most cases, this buffer
>> needs to be registered only in cases when there are multiple overflow
>> pages, I think having fixed information might hurt in some cases.
>
> Just because the WAL record gets larger?  I think you could arrange to
> record the data only in the cases where it is needed, but I'm also not
> sure it would matter that much anyway.  Off-hand, it seems better than
> having to mark the primary bucket page dirty (because you have to set
> the LSN) every time any page in the chain is modified.
>

Do we really need to set LSN on this page (or mark it dirty), if so
why?  Are you worried about restoration of FPI or something else?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do we really need to set LSN on this page (or mark it dirty), if so
> why?  Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Do we really need to set LSN on this page (or mark it dirty), if so
>> why?  Are you worried about restoration of FPI or something else?
>
> I haven't thought through all of the possible consequences and am a
> bit to tired to do so just now, but doesn't it seem rather risky to
> invent a whole new way of using these xlog functions?
> src/backend/access/transam/README describes how to do write-ahead
> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
> described as an optional step.
>

Just to salvage my point, I think this is not the first place where we
register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
register heap and vm buffers but set the LSN conditionally on heap
buffer.  Having said that, I see the value of your point and I am open
to doing it that way if you feel that is a better way.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Do we really need to set LSN on this page (or mark it dirty), if so
>>> why?  Are you worried about restoration of FPI or something else?
>>
>> I haven't thought through all of the possible consequences and am a
>> bit to tired to do so just now, but doesn't it seem rather risky to
>> invent a whole new way of using these xlog functions?
>> src/backend/access/transam/README describes how to do write-ahead
>> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
>> described as an optional step.
>
> Just to salvage my point, I think this is not the first place where we
> register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
> register heap and vm buffers but set the LSN conditionally on heap
> buffer.  Having said that, I see the value of your point and I am open
> to doing it that way if you feel that is a better way.

Right, we did that, and it seems to have worked.  But it was a scary
exception that required a lot of thought.  Now, since we did it once,
we could do it again, but I am not sure it is for the best.  In the
case of vacuum, we knew that a vacuum on a large table could otherwise
emit an FPI for every page, which would almost double the amount of
write I/O generated by a vacuum - instead of WAL records + heap pages,
you'd be writing FPIs + heap pages, a big increase.  Now here I think
that the amount of write I/O that it's costing us is not so clear.
Unless it's going to be really big, I'd rather do this in some way we
can think is definitely safe.

Also, if we want to avoid dirtying the primary bucket page by setting
the LSN, IMHO the way to do that is to store the block number of the
primary bucket page without registering the buffer, and then during
recovery, lock that block.  That seems cleaner than hoping that we can
take an FPI without setting the page LSN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 6:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> Do we really need to set LSN on this page (or mark it dirty), if so
>>>> why?  Are you worried about restoration of FPI or something else?
>>>
>>> I haven't thought through all of the possible consequences and am a
>>> bit to tired to do so just now, but doesn't it seem rather risky to
>>> invent a whole new way of using these xlog functions?
>>> src/backend/access/transam/README describes how to do write-ahead
>>> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
>>> described as an optional step.
>>
>> Just to salvage my point, I think this is not the first place where we
>> register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
>> register heap and vm buffers but set the LSN conditionally on heap
>> buffer.  Having said that, I see the value of your point and I am open
>> to doing it that way if you feel that is a better way.
>
> Right, we did that, and it seems to have worked.  But it was a scary
> exception that required a lot of thought.  Now, since we did it once,
> we could do it again, but I am not sure it is for the best.  In the
> case of vacuum, we knew that a vacuum on a large table could otherwise
> emit an FPI for every page, which would almost double the amount of
> write I/O generated by a vacuum - instead of WAL records + heap pages,
> you'd be writing FPIs + heap pages, a big increase.  Now here I think
> that the amount of write I/O that it's costing us is not so clear.
> Unless it's going to be really big, I'd rather do this in some way we
> can think is definitely safe.
>
> Also, if we want to avoid dirtying the primary bucket page by setting
> the LSN, IMHO the way to do that is to store the block number of the
> primary bucket page without registering the buffer, and then during
> recovery, lock that block.  That seems cleaner than hoping that we can
> take an FPI without setting the page LSN.
>

I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
I/O and will make it safe as well.  I think that makes registering the
buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 8:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
> XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
> I/O and will make it safe as well.  I think that makes registering the
> buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.

Hmm, yeah, that might work, though I would probably want to spend some
time studying the patch and thinking about it before deciding for
sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
>                  /*
> +                 * Change the shared buffer state in critical section,
> +                 * otherwise any error could make it unrecoverable after
> +                 * recovery.
> +                 */
> +                START_CRIT_SECTION();
> +
> +                /*
>                   * Insert tuple on new page, using _hash_pgaddtup to ensure
>                   * correct ordering by hashkey.  This is a tad inefficient
>                   * since we may have to shuffle itempointers repeatedly.
>                   * Possible future improvement: accumulate all the items for
>                   * the new page and qsort them before insertion.
>                   */
>                  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>
> +                END_CRIT_SECTION();
>
> No way.  You have to start the critical section before making any page
> modifications and keep it alive until all changes have been logged.
>

I think what we need to do here is to accumulate all the tuples that
need to be added to new bucket page till either that page has no more
space or there are no more tuples remaining in an old bucket.  Then in
a critical section, add them to the page using _hash_pgaddmultitup and
log the entire new bucket page contents as is currently done in patch
log_split_page().  Now, here we can choose to log the individual
tuples as well instead of a complete page, however not sure if there
is any benefit for doing the same because XLogRecordAssemble() will
anyway remove the empty space from the page.  Let me know if you have
something else in mind.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>                  /*
>> +                 * Change the shared buffer state in critical section,
>> +                 * otherwise any error could make it unrecoverable after
>> +                 * recovery.
>> +                 */
>> +                START_CRIT_SECTION();
>> +
>> +                /*
>>                   * Insert tuple on new page, using _hash_pgaddtup to ensure
>>                   * correct ordering by hashkey.  This is a tad inefficient
>>                   * since we may have to shuffle itempointers repeatedly.
>>                   * Possible future improvement: accumulate all the items for
>>                   * the new page and qsort them before insertion.
>>                   */
>>                  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>>
>> +                END_CRIT_SECTION();
>>
>> No way.  You have to start the critical section before making any page
>> modifications and keep it alive until all changes have been logged.
>>
>
> I think what we need to do here is to accumulate all the tuples that
> need to be added to new bucket page till either that page has no more
> space or there are no more tuples remaining in an old bucket.  Then in
> a critical section, add them to the page using _hash_pgaddmultitup and
> log the entire new bucket page contents as is currently done in patch
> log_split_page().

I agree.

> Now, here we can choose to log the individual
> tuples as well instead of a complete page, however not sure if there
> is any benefit for doing the same because XLogRecordAssemble() will
> anyway remove the empty space from the page.  Let me know if you have
> something else in mind.

Well, if you have two pages that are 75% full, and you move a third of
the tuples from one of them into the other, it's going to be about
four times more efficient to log only the moved tuples than the whole
page.  But logging the whole filled page wouldn't be wrong, just
somewhat inefficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> The text you've added to the hash index README throws around the term
> "single WAL entry" pretty freely.  For example: "An insertion that
> causes an addition of an overflow page is logged as a single WAL entry
> preceded by a WAL entry for new overflow page required to insert a
> tuple."  When you say a single WAL entry, you make it sound like
> there's only one, but because it's preceded by another WAL entry,
> there are actually two.  I would rephase this to just avoid using the
> word "single" this way.  For example: "If an insertion causes the
> addition of an overflow page, there will be one WAL entry for the new
> overflow page and a second entry for the insert itself."
>

Okay, changed as per suggestion.  I have also removed the second part
of the above comment which states that newly allocated overflow page
can be used by concurrent transactions.  That won't be true after
fixing release/reacquire stuff in _hash_addovflpage.  I have also
rephrased another similar usage of "single WAL entry".

> +mode anyway).  It would seem natural to complete the split in VACUUM, but since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space.  That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space.  In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>

As discussed up thread, there is no need to change.

> +Squeeze operation moves tuples from one of the buckets later in the chain to
>
> A squeeze operation
>
> +As Squeeze operation involves writing multiple atomic operations, it is
>
> As a squeeze operation involves multiple atomic operations
>

Changed as per suggestion.

> +                if (!xlrec.is_primary_bucket_page)
> +                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
> Not sure exactly what the fix should be.
>

I have used REGBUF_NO_IMAGE while registering buffer to avoid this
hazard and during replay used XLogReadBufferForRedoExtended() to read
block and take cleanup lock on it.

> +            /*
> +             * As bitmap page doesn't have standard page layout, so this will
> +             * allow us to log the data.
> +             */
> +            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> +            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD?  But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it.  We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> +        recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
> +
> +        PageSetLSN(BufferGetPage(ovflbuf), recptr);
> +        PageSetLSN(BufferGetPage(buf), recptr);
> +
> +        if (BufferIsValid(mapbuf))
> +            PageSetLSN(BufferGetPage(mapbuf), recptr);
> +
> +        if (BufferIsValid(newmapbuf))
> +            PageSetLSN(BufferGetPage(newmapbuf), recptr);
>
> I think you forgot to call PageSetLSN() on metabuf.  (There are 5
> block references in the write-ahead log record but only 4 calls to
> PageSetLSN ... surely that can't be right!)
>

Right, fixed.

> +    /*
> +     * We need to release the locks once the prev pointer of overflow bucket
> +     * and next of left bucket are set, otherwise concurrent read might skip
> +     * the bucket.
> +     */
> +    if (BufferIsValid(leftbuf))
> +        UnlockReleaseBuffer(leftbuf);
> +    UnlockReleaseBuffer(ovflbuf);
>
> I don't believe the comment.  Postponing the lock release wouldn't
> cause the concurrent read to skip the bucket.  It might cause it to be
> delayed unnecessarily, but the following comment already addresses
> that point.  I think you can just nuke this comment.
>

Okay, removed the comment.

> +        xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
> +        xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
>
> Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?
>

Changed as per suggestion.

> +    /*
> +     * We release locks on writebuf and bucketbuf at end of replay operation
> +     * to ensure that we hold lock on primary bucket page till end of
> +     * operation.  We can optimize by releasing the lock on write buffer as
> +     * soon as the operation for same is complete, if it is not same as
> +     * primary bucket page, but that doesn't seem to be worth complicating the
> +     * code.
> +     */
> +    if (BufferIsValid(writebuf))
> +        UnlockReleaseBuffer(writebuf);
> +
> +    if (BufferIsValid(bucketbuf))
> +        UnlockReleaseBuffer(bucketbuf);
>
> Here in hash_xlog_squeeze_page(), you wait until the very end to
> release these locks, but in e.g. hash_xlog_addovflpage you release
> them considerably earlier for reasons that seem like they would also
> be valid here.  I think you could move this back to before the updates
> to the metapage and bitmap page.
>

Sure, doesn't see any problem with releasing locks early, so changed
as per suggestion.

> +        /*
> +         * Note: in normal operation, we'd update the metapage while still
> +         * holding lock on the page we inserted into.  But during replay it's
> +         * not necessary to hold that lock, since no other index updates can
> +         * be happening concurrently.
> +         */
>
> This comment in hash_xlog_init_bitmap_page seems to be off-base,
> because we didn't just insert into a page.  I'm not disputing that
> it's safe, though: not only can nobody else be modifying it because
> we're in recovery, but also nobody else can see it yet; the creating
> transaction hasn't yet committed.
>

Fixed the comment and mentioned the second reason as suggested by you
(that sounds appropriate here).

> +    /*
> +     * Note that we still update the page even if it was restored from a full
> +     * page image, because the bucket flag is not included in the image.
> +     */
>
> Really?  Why not?  (Because it's part of the special space?)
>

Added a comment as discussed above.

> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
> guard against concurrent scans?

Changed and made it similar to normal operations by taking cleanup lock.

> +    /*
> +     * There is no harm in releasing the lock on old bucket before new bucket
> +     * during replay as no other bucket splits can be happening concurrently.
> +     */
> +    if (BufferIsValid(oldbuf))
> +        UnlockReleaseBuffer(oldbuf);
>
> And I'm not sure this is safe either.  The issue isn't that there
> might be another bucket split happening concurrently, but that there
> might be scans that get confused by seeing the bucket split flag set
> before the new bucket is created or the metapage updated.  Seems like
> it would be safer to keep all the locks for this one.
>

Changed as discussed.

> +     * Initialise the new bucket page, here we can't complete zeroed the page
>
> I think maybe a good way to write these would be "we can't simply zero
> the page".  And Initialise -> Initialize.
>

I have changed the comment, see if that looks okay to you.

>                  /*
> +                 * Change the shared buffer state in critical section,
> +                 * otherwise any error could make it unrecoverable after
> +                 * recovery.
> +                 */
> +                START_CRIT_SECTION();
> +
> +                /*
>                   * Insert tuple on new page, using _hash_pgaddtup to ensure
>                   * correct ordering by hashkey.  This is a tad inefficient
>                   * since we may have to shuffle itempointers repeatedly.
>                   * Possible future improvement: accumulate all the items for
>                   * the new page and qsort them before insertion.
>                   */
>                  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>
> +                END_CRIT_SECTION();
>
> No way.  You have to start the critical section before making any page
> modifications and keep it alive until all changes have been logged.
>

Changed as discussed up thread.

> hash_redo's mappings of record types to function names is a little
> less regular than seems desirable.  Generally, the function name is
> the lower-case version of the record type with "hash" and "xlog"
> switched.  But XLOG_HASH_ADD_OVFL_PAGE and
> XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.
>

Okay, changed as per suggestion and I went ahead and change the name
of corresponding xlog record structures as well because those also
match the function name pattern.

> It seems like a good test to do with this patch would be to set up a
> pgbench test on the master with a hash index replacing the usual btree
> index.  Then, set up a standby and run a read-only pgbench on the
> standby while a read-write pgbench test runs on the master.  Maybe
> you've already tried something like that?
>

I also think so and apart from that I think it makes sense to perform
recovery test by Jeff Janes tool and probably tests with
wal_consistency_check. These tests are already running from past seven
hours or so and I will keep them running for the whole night to see if
there is any discrepancy.  Thanks to Kuntal and Ashutosh for helping
me in doing these stress tests.

Note that I have fixed all other comments raised by you in another
e-mail including adding the test for snapshot too old (I have prepared
a separate patch for the test.  I have added just one test to keep the
timing short because each test will take minimum 6s to complete.  This
6s requirement is what is required by existing tests or any other test
for snapshot too old).  Let me know if I have missed anything?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Sun, Mar 12, 2017 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>                  /*
>>> +                 * Change the shared buffer state in critical section,
>>> +                 * otherwise any error could make it unrecoverable after
>>> +                 * recovery.
>>> +                 */
>>> +                START_CRIT_SECTION();
>>> +
>>> +                /*
>>>                   * Insert tuple on new page, using _hash_pgaddtup to ensure
>>>                   * correct ordering by hashkey.  This is a tad inefficient
>>>                   * since we may have to shuffle itempointers repeatedly.
>>>                   * Possible future improvement: accumulate all the items for
>>>                   * the new page and qsort them before insertion.
>>>                   */
>>>                  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>>>
>>> +                END_CRIT_SECTION();
>>>
>>> No way.  You have to start the critical section before making any page
>>> modifications and keep it alive until all changes have been logged.
>>>
>>
>> I think what we need to do here is to accumulate all the tuples that
>> need to be added to new bucket page till either that page has no more
>> space or there are no more tuples remaining in an old bucket.  Then in
>> a critical section, add them to the page using _hash_pgaddmultitup and
>> log the entire new bucket page contents as is currently done in patch
>> log_split_page().
>
> I agree.
>

Okay, I have changed like that in the patch.

>> Now, here we can choose to log the individual
>> tuples as well instead of a complete page, however not sure if there
>> is any benefit for doing the same because XLogRecordAssemble() will
>> anyway remove the empty space from the page.  Let me know if you have
>> something else in mind.
>
> Well, if you have two pages that are 75% full, and you move a third of
> the tuples from one of them into the other, it's going to be about
> four times more efficient to log only the moved tuples than the whole
> page.
>

I don't see how this could happen during split?  I mean if you are
moving 25% tuples from old bucket page to a new bucket page which is
75% full, it will log the new bucket page only after it gets full.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Mon, Mar 13, 2017 at 6:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> Great, thanks.  0001 looks good to me now, so committed.
>>>>
>>>> Committed 0002.
>>>
>>> Here are some initial review thoughts on 0003 based on a first read-through.
>>
>
>> It seems like a good test to do with this patch would be to set up a
>> pgbench test on the master with a hash index replacing the usual btree
>> index.  Then, set up a standby and run a read-only pgbench on the
>> standby while a read-write pgbench test runs on the master.  Maybe
>> you've already tried something like that?
>>
>
> I also think so and apart from that I think it makes sense to perform
> recovery test by Jeff Janes tool and probably tests with
> wal_consistency_check. These tests are already running from past seven
> hours or so and I will keep them running for the whole night to see if
> there is any discrepancy.
>

We didn't found any issue with the above testing.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> We didn't found any issue with the above testing.

Great!  I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Great!  I've committed the latest version of the patch, with some
> cosmetic changes.

Woo hoo!  That's been a bee in the bonnet for, um, decades.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Mar 14, 2017 at 1:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Great!  I've committed the latest version of the patch, with some
>> cosmetic changes.
>
> Woo hoo!  That's been a bee in the bonnet for, um, decades.

Yeah.  I'm pretty happy to see that go in.

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.  Mithun has submitted a patch
which - if it's acceptable - would alleviate that to some degree by
splitting up each doubling into four separate increments.  But that
still could lead to very large growth increments on big indexes, like
your 32GB index suddenly growing - in one fell swoop - to 40GB.
That's admittedly a lot better than what will happen right now, which
is instantaneous growth to 64GB, but it's not great, and it's not
altogether clear how it could be fixed in a really satisfying way.

Other things that are not so great:

- no multi-column support
- no amcanunique support
- every insert dirties the metapage
- splitting is generally too aggressive; very few overflow pages are
ever created unless you have piles of duplicates

Still, this is a big step forward.  I think it will be useful for
people with long text strings or other very wide data that they want
to index, and hopefully we'll get some feedback from users about where
this works well and where it doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> It's become pretty clear to me that there are a bunch of other things
> about hash indexes which are not exactly great, the worst of which is
> the way they grow by DOUBLING IN SIZE.

Uh, what?  Growth should happen one bucket-split at a time.

> Other things that are not so great:

> - no multi-column support
> - no amcanunique support
> - every insert dirties the metapage
> - splitting is generally too aggressive; very few overflow pages are
> ever created unless you have piles of duplicates

Yeah.  It's a bit hard to see how to add multi-column support unless you
give up the property of allowing queries on a subset of the index columns.
Lack of amcanunique seems like mostly a round-tuit shortage.  The other
two are implementation deficiencies that maybe we can remedy someday.

Another thing I'd like to see is support for 64-bit hash values.

But all of these were mainly blocked by people not wanting to sink effort
into hash indexes as long as they were unusable for production due to lack
of WAL support.  So this is a huge step forward.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It's become pretty clear to me that there are a bunch of other things
>> about hash indexes which are not exactly great, the worst of which is
>> the way they grow by DOUBLING IN SIZE.
>
> Uh, what?  Growth should happen one bucket-split at a time.

Technically, the buckets are created one at a time, but because of the
way hashm_spares works, the primary bucket pages for all bucket from
2^N to 2^{N+1}-1 must be physically consecutive.  See
_hash_alloc_buckets.

>> Other things that are not so great:
>
>> - no multi-column support
>> - no amcanunique support
>> - every insert dirties the metapage
>> - splitting is generally too aggressive; very few overflow pages are
>> ever created unless you have piles of duplicates
>
> Yeah.  It's a bit hard to see how to add multi-column support unless you
> give up the property of allowing queries on a subset of the index columns.
> Lack of amcanunique seems like mostly a round-tuit shortage.  The other
> two are implementation deficiencies that maybe we can remedy someday.
>
> Another thing I'd like to see is support for 64-bit hash values.
>
> But all of these were mainly blocked by people not wanting to sink effort
> into hash indexes as long as they were unusable for production due to lack
> of WAL support.  So this is a huge step forward.

Agreed, on all points.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> It's become pretty clear to me that there are a bunch of other things
>>> about hash indexes which are not exactly great, the worst of which is
>>> the way they grow by DOUBLING IN SIZE.

>> Uh, what?  Growth should happen one bucket-split at a time.

> Technically, the buckets are created one at a time, but because of the
> way hashm_spares works, the primary bucket pages for all bucket from
> 2^N to 2^{N+1}-1 must be physically consecutive.  See
> _hash_alloc_buckets.

Right, but we only fill those pages one at a time.

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot.  But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Robert Haas <robertmhaas@gmail.com> writes:
> >>> It's become pretty clear to me that there are a bunch of other things
> >>> about hash indexes which are not exactly great, the worst of which is
> >>> the way they grow by DOUBLING IN SIZE.
>
> >> Uh, what?  Growth should happen one bucket-split at a time.
>
> > Technically, the buckets are created one at a time, but because of the
> > way hashm_spares works, the primary bucket pages for all bucket from
> > 2^N to 2^{N+1}-1 must be physically consecutive.  See
> > _hash_alloc_buckets.
>
> Right, but we only fill those pages one at a time.
>
> It's true that as soon as we need another overflow page, that's going to
> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
> index will grow quite a lot.  But any modern filesystem should handle
> that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

If we think they're ok to do, it'd be awful nice to figure out a way for
VACUUM to turn an entirely-empty 1G chunk into a sparse file..

> There may be some work to be done in places like pg_basebackup to
> recognize and deal with sparse files, but it doesn't seem like a
> reason to panic.

Well, and every file-based backup tool out there..

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> It's true that as soon as we need another overflow page, that's going to
>> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
>> index will grow quite a lot.  But any modern filesystem should handle
>> that without much difficulty by treating the index as a sparse file.

> Uh, last I heard we didn't allow or want sparse files in the backend
> because then we have to handle a possible out-of-disk-space failure on
> every write.

For a hash index, this would happen during a bucket split, which would
need to be resilient against out-of-disk-space anyway.

>> There may be some work to be done in places like pg_basebackup to
>> recognize and deal with sparse files, but it doesn't seem like a
>> reason to panic.

> Well, and every file-based backup tool out there..

Weren't you the one leading the charge to deprecate use of file-based
backup?
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> It's true that as soon as we need another overflow page, that's going to
> >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
> >> index will grow quite a lot.  But any modern filesystem should handle
> >> that without much difficulty by treating the index as a sparse file.
>
> > Uh, last I heard we didn't allow or want sparse files in the backend
> > because then we have to handle a possible out-of-disk-space failure on
> > every write.
>
> For a hash index, this would happen during a bucket split, which would
> need to be resilient against out-of-disk-space anyway.

We wouldn't attempt to use the area of the file which is not yet
allocated except when doing a bucket split?  If that's the case then
this does seem to at least be less of an issue, though I hope we put in
appropriate comments about it.

> >> There may be some work to be done in places like pg_basebackup to
> >> recognize and deal with sparse files, but it doesn't seem like a
> >> reason to panic.
>
> > Well, and every file-based backup tool out there..
>
> Weren't you the one leading the charge to deprecate use of file-based
> backup?

No, nor do I see how we would ever be able to deprecate file-based
backups.  If anything, I'd like to see us improve our support for
them.  I'm certainly curious where the notion that I was ever in favor
of deprecating them came from, particularly given all of the effort that
David and I have been pouring into our favorite file-based backup tool
over the past few years.

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Mark Kirkwood
Date:
On 15/03/17 06:29, Robert Haas wrote:

>
> Great!  I've committed the latest version of the patch, with some
> cosmetic changes.
>
> It would be astonishing if there weren't a bug or two left, but I
> think overall this is very solid work, and I think it's time to put
> this out there and see how things go.
>

Awesome, great work!

Cheers

Mark



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> We didn't found any issue with the above testing.
>
> Great!  I've committed the latest version of the patch, with some
> cosmetic changes.
>

Thanks a lot.  I have noticed that the test case patch for "snapshot
too old" is not committed.  Do you want to leave that due to timing
requirement or do you want me to submit it separately so that we can
deal with it separately and may take an input from Kevin?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> >> It's true that as soon as we need another overflow page, that's going to
>> >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
>> >> index will grow quite a lot.  But any modern filesystem should handle
>> >> that without much difficulty by treating the index as a sparse file.
>>
>> > Uh, last I heard we didn't allow or want sparse files in the backend
>> > because then we have to handle a possible out-of-disk-space failure on
>> > every write.
>>
>> For a hash index, this would happen during a bucket split, which would
>> need to be resilient against out-of-disk-space anyway.
>
> We wouldn't attempt to use the area of the file which is not yet
> allocated except when doing a bucket split?
>

That's right.

>  If that's the case then
> this does seem to at least be less of an issue, though I hope we put in
> appropriate comments about it.
>

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote:
> >  If that's the case then
> > this does seem to at least be less of an issue, though I hope we put in
> > appropriate comments about it.
>
> I think we have sufficient comments in code especially on top of
> function _hash_alloc_buckets().

I don't see any comments regarding how we have to be sure to handle
an out-of-space case properly in the middle of a file because we've made
it sparse.

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

Additional comments here would be overkill if we think that the lower
layers are already all set up to handle such a case properly and we
don't have to do anything special, but I had understood that we didn't
generally think that sparse files would just work.  I'm certainly happy
to be wrong there because, if that's the case, it'd be a great way to
improve the problems we have returning large chunks of free space in the
middle of a relation to the OS, but if there is a real concern here then
we need to work out what it is and probably add comments or possibly add
code to address whatever it is.

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> I think we have sufficient comments in code especially on top of
>> function _hash_alloc_buckets().
>
> I don't see any comments regarding how we have to be sure to handle
> an out-of-space case properly in the middle of a file because we've made
> it sparse.
>
> I do see that mdwrite() should handle an out-of-disk-space case, though
> that just makes me wonder what's different here compared to normal
> relations that we don't have an issue with a sparse WAL'd hash index but
> we can't handle it if a normal relation is sparse.

I agree.  I think that what hash indexes are doing here is
inconsistent with what we do for btree indexes and the heap.  And I
don't think it would be bad to fix that.  We could, of course, go the
other way and do what Tom is suggesting - insist that everybody's got
to be prepared for sparse files, but I would view that as something of
a U-turn.  I think hash indexes are like this because nobody's really
worried about hash indexes because they haven't been crash-safe or
performant.  Now that we've made them crash safe and are on the way to
making them performant, fixing other things is, as Tom already said, a
lot more interesting.

Now, that having been said, I'm not sure it's a good idea to tinker
with the behavior for v10.  We could change the new-splitpoint code so
that it loops over all the pages in the new splitpoint and zeroes them
all, instead of just the last one.  If we all agree that's how it
should work, then it's probably not a lot of work to make the change.
But if what's needed is anything more than that or we don't all agree,
then we'd better just leave it alone and we can revisit it for v11.
It's too late to start making significant or controversial design
changes at this point, and you could argue that this is a preexisting
design defect which the WAL-logging patch wasn't necessarily obligated
to fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Now, that having been said, I'm not sure it's a good idea to tinker
> with the behavior for v10.  We could change the new-splitpoint code so
> that it loops over all the pages in the new splitpoint and zeroes them
> all, instead of just the last one.

Why would we do that?  That would change the behavior from something
that's arguably OK (at least given the right filesystem) to something
that's clearly not very OK.

> It's too late to start making significant or controversial design
> changes at this point, and you could argue that this is a preexisting
> design defect which the WAL-logging patch wasn't necessarily obligated
> to fix.

Yes, and it is certainly that, and no this patch wasn't chartered to fix
it.  I don't have a problem with leaving things like this for v10.

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files.  We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I do see that mdwrite() should handle an out-of-disk-space case, though
> that just makes me wonder what's different here compared to normal
> relations that we don't have an issue with a sparse WAL'd hash index but
> we can't handle it if a normal relation is sparse.

*Any* write has to be prepared to handle errors.  There's always a risk of
EIO, and on a COW filesystem you might well get ENOSPC even when you think
you're overwriting previously-allocated storage.  All that we are doing by
pre-allocating storage is reducing the risks a bit, not guaranteeing that
no error will happen.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> FWIW, I'm not certain that Stephen is correct to claim that we have
> some concrete problem with sparse files.  We certainly don't *depend*
> on sparse storage anyplace else, nor write data in a way that would be
> likely to trigger it; but I'm not aware that we need to work hard to
> avoid it.

I don't have any concrete evidence that there is an issue, just a
recollection of bringing up exactly the idea of turning entirely empty
1G segments into sparse files to free that space back to the filesystem
during a VACUUM during a discussion somewhere along the way and being
told that it'd be bad because we would run into issues later when we try
to write those pages back out and discover we're out of disk space.

Naturally, there's a lot to consider with such a change to VACUUM like
how we would WAL that and how we'd make sure that nothing is about to
try and use any of those pages (perhaps start at the end of the segment
and lock the pages, or just try to get an exclusive lock on the table as
we do when we try to truncate the relation because there's free space at
the end?), but it'd be a very interesting project to consider, if we are
willing to accept sparse heap files.

Another interesting idea might be to somehow change the now-empty
relation into just a place-holder that says "assume I'm 1G in size and
empty." but actually have the file be shrunk.

Anyway, I tend to agree with the sentiment that we don't need this patch
to change the behavior here and perhaps it'll be good to see what
happens when people start using these sparsh hash indexes, maybe that
will shed some light on if we have anything to worry about here or not,
and if not then we can consider having sparse files elsewhere.

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I'm not certain that Stephen is correct to claim that we have
> some concrete problem with sparse files.  We certainly don't *depend*
> on sparse storage anyplace else, nor write data in a way that would be
> likely to trigger it; but I'm not aware that we need to work hard to
> avoid it.

That theory seems inconsistent with how mdextend() works.  My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level.  If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably.  If we do care about failures due to deferred
allocation, then leaving holes in the file is a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Tue, Mar 14, 2017 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> We didn't found any issue with the above testing.
>>
>> Great!  I've committed the latest version of the patch, with some
>> cosmetic changes.
>>
>
> Thanks a lot.  I have noticed that the test case patch for "snapshot
> too old" is not committed.  Do you want to leave that due to timing
> requirement or do you want me to submit it separately so that we can
> deal with it separately and may take an input from Kevin?

I've committed that now as well.  Let's see what the buildfarm thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > FWIW, I'm not certain that Stephen is correct to claim that we have
> > some concrete problem with sparse files.  We certainly don't *depend*
> > on sparse storage anyplace else, nor write data in a way that would be
> > likely to trigger it; but I'm not aware that we need to work hard to
> > avoid it.
>
> That theory seems inconsistent with how mdextend() works.  My
> understanding is that we zero-fill the new blocks before populating
> them with actual data precisely to avoid running out of disk space due
> to deferred allocation at the OS level.  If we don't care about
> failures due to deferred allocation at the OS level, we can rip that
> logic out and improve the performance of relation extension
> considerably.  If we do care about failures due to deferred
> allocation, then leaving holes in the file is a bad idea.

That is a fantastic point.

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That theory seems inconsistent with how mdextend() works.  My
> understanding is that we zero-fill the new blocks before populating
> them with actual data precisely to avoid running out of disk space due
> to deferred allocation at the OS level.  If we don't care about
> failures due to deferred allocation at the OS level, we can rip that
> logic out and improve the performance of relation extension
> considerably.

See my reply to Stephen.  The fact that this fails to guarantee no
ENOSPC on COW filesystems doesn't mean that it's not worth doing on
other filesystems.  We're reducing the risk, not eliminating it,
but reducing risk is still a worthwhile activity.
        regards, tom lane



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Stephen Frost
Date:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > That theory seems inconsistent with how mdextend() works.  My
> > understanding is that we zero-fill the new blocks before populating
> > them with actual data precisely to avoid running out of disk space due
> > to deferred allocation at the OS level.  If we don't care about
> > failures due to deferred allocation at the OS level, we can rip that
> > logic out and improve the performance of relation extension
> > considerably.
>
> See my reply to Stephen.  The fact that this fails to guarantee no
> ENOSPC on COW filesystems doesn't mean that it's not worth doing on
> other filesystems.  We're reducing the risk, not eliminating it,
> but reducing risk is still a worthwhile activity.

Considering how much work we end up doing to extend a relation and how
we know that's been a hotspot, I'm not entirely sure I agree that
avoiding the relativly infrequent out-of-disk-space concern when
extending the relation (instead of letting it happen when we go to
actually write data into the page) really is a good trade-off to make.

Thanks!

Stephen

Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Robert Haas
Date:
On Wed, Mar 15, 2017 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> That theory seems inconsistent with how mdextend() works.  My
>> understanding is that we zero-fill the new blocks before populating
>> them with actual data precisely to avoid running out of disk space due
>> to deferred allocation at the OS level.  If we don't care about
>> failures due to deferred allocation at the OS level, we can rip that
>> logic out and improve the performance of relation extension
>> considerably.
>
> See my reply to Stephen.  The fact that this fails to guarantee no
> ENOSPC on COW filesystems doesn't mean that it's not worth doing on
> other filesystems.  We're reducing the risk, not eliminating it,
> but reducing risk is still a worthwhile activity.

Well, then it would presumably be worth reducing for hash indexes, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Write Ahead Logging for Hash Indexes

From
Amit Kapila
Date:
On Wed, Mar 15, 2017 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>> I think we have sufficient comments in code especially on top of
>>> function _hash_alloc_buckets().
>>
>> I don't see any comments regarding how we have to be sure to handle
>> an out-of-space case properly in the middle of a file because we've made
>> it sparse.
>>
>> I do see that mdwrite() should handle an out-of-disk-space case, though
>> that just makes me wonder what's different here compared to normal
>> relations that we don't have an issue with a sparse WAL'd hash index but
>> we can't handle it if a normal relation is sparse.
>
> I agree.  I think that what hash indexes are doing here is
> inconsistent with what we do for btree indexes and the heap.  And I
> don't think it would be bad to fix that.  We could, of course, go the
> other way and do what Tom is suggesting - insist that everybody's got
> to be prepared for sparse files, but I would view that as something of
> a U-turn.  I think hash indexes are like this because nobody's really
> worried about hash indexes because they haven't been crash-safe or
> performant.  Now that we've made them crash safe and are on the way to
> making them performant, fixing other things is, as Tom already said, a
> lot more interesting.
>
> Now, that having been said, I'm not sure it's a good idea to tinker
> with the behavior for v10.  We could change the new-splitpoint code so
> that it loops over all the pages in the new splitpoint and zeroes them
> all, instead of just the last one.
>

Yeah, but I am slightly afraid that apart from consuming too much
space, it will make the operation lot slower when we have to perform
the allocation step.  Another possibility is to do that when we
actually use/allocate the bucket?  As of now, _hash_getnewbuf assumes
that the block is pre-existing and avoid the actual read/extend by
using RBM_ZERO_AND_LOCK mode.   I think we can change it so that it
forces a new allocation (if the block doesn't exist) on need.  I agree
this is somewhat more change than what you have proposed to circumvent
the sparse file behavior, but may not be a ton more work if we decide
to fix and has the advantage of allocating the space on disk on actual
need.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com