Thread: Write Ahead Logging for Hash Indexes
$SUBJECT will make hash indexes reliable and usable on standby. AFAIU, currently hash indexes are not recommended to be used in production mainly because they are not crash-safe and with this patch, I hope we can address that limitation and recommend them for use in production. This patch is built on my earlier patch [1] of making hash indexes concurrent. The main reason for doing so is that the earlier patch allows to complete the split operation and used light-weight locking due to which operations can be logged at granular level. WAL for different operations: This has been explained in README as well, but I am again writing it here for the ease of people. ================================= Multiple WAL records are being written for create index operation, first for initializing the metapage, followed by one for each new bucket created during operation followed by one for initializing the bitmap page. If the system crashes after any operation, the whole operation is rolledback. I have considered to write a single WAL record for the whole operation, but for that we need to limit the number of initial buckets that can be created during the operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with current XLog machinery, it is better to write multiple WAL records for this operation. The downside of restricting the number of buckets is that we need to perform split operation if the number of tuples are more than what can be accommodated in initial set of buckets and it is not unusual to have large number of tuples during create index operation. Ordinary item insertions (that don't force a page split or need a new overflow page) are single WAL entries. They touch a single bucket page and meta page, metapage is updated during replay as it is updated during original operation. An insertion that causes an addition of an overflow page is logged as a single WAL entry preceded by a WAL entry for a new overflow page required to insert a tuple. There is a corner case where by the time we try to use newly allocated overflow page, it already gets used by concurrent insertions, for such a case, a new overflow page will be allocated and a separate WAL entry will be made for the same. An insertion that causes a bucket split is logged as a single WAL entry, followed by a WAL entry for allocating a new bucket, followed by a WAL entry for each overflow bucket page in the new bucket to which the tuples are moved from old bucket, followed by a WAL entry to indicate that split is complete for both old and new buckets. A split operation which requires overflow pages to complete the operation will need to write a WAL record for each new allocation of an overflow page. As splitting involves multiple atomic actions, it's possible that the system crashes between moving tuples from bucket pages of old bucket to new bucket. After recovery, both the old and new buckets will be marked with in_complete split flag. The reader algorithm works correctly, as it will scan both the old and new buckets. We finish the split at next insert or split operation on old bucket. It could be done during searches, too, but it seems best not to put any extra updates in what would otherwise be a read-only operation (updating is not possible in hot standby mode anyway). It would seem natural to complete the split in VACUUM, but since splitting a bucket might require to allocate a new page, it might fail if you run out of disk space. That would be bad during VACUUM - the reason for running VACUUM in the first place might be that you run out of disk space, and now VACUUM won't finish because you're out of disk space. In contrast, an insertion can require enlarging the physical file anyway. Deletion of tuples from a bucket is performed for two reasons, one for removing the dead tuples and other for removing the tuples that are moved by split. WAL entry is made for each bucket page from which tuples are removed, followed by a WAL entry to clear the garbage flag if the tuples moved by split are removed. Another separate WAL entry is made for updating the metapage if the deletion is performed for removing the dead tuples by vaccum. As deletion involves multiple atomic operations, it is quite possible that system crashes after (a) removing tuples from some of the bucket pages (b) before clearing the garbage flag (c) before updating the metapage. If the system crashes before completing (b), it will again try to clean the bucket during next vacuum or insert after recovery which can have some performance impact, but it will work fine. If the system crashes before completing (c), after recovery there could be some additional splits till the next vacuum updates the metapage, but the other operations like insert, delete and scan will work correctly. We can fix this problem by actually updating the metapage based on delete operation during replay, but not sure if it is worth the complication. Squeeze operation moves tuples from one of the buckets later in the chain to one of the bucket earlier in chain and writes WAL record when either the bucket to which it is writing tuples is filled or bucket from which it is removing the tuples becomes empty. As Squeeze operation involves writing multiple atomic operations, it is quite possible, that system crashes before completing the operation on entire bucket. After recovery, the operations will work correctly, but the index will remain bloated and can impact performance of read and insert operations until the next vacuum squeezes the bucket completely. ===================================== One of the challenge in writing this patch was that the current code was not written with a mindset that we need to write WAL for different operations. Typical example is _hash_addovflpage() where pages are modified across different function calls and all modifications needs to be done atomically, so I have to refactor some code so that the operations can be logged sensibly. Thanks to Ashutosh Sharma who has helped me in completing the patch by writing WAL for create index and delete operation and done the detailed testing of patch by using pg_filedump tool. I think it is better if he himself explains the testing he has done to ensure correctness of patch. Thoughts? Note - To use this patch, first apply latest version of concurrent hash index patch [2]. [1] - https://commitfest.postgresql.org/10/647/ [2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Aug 23, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > $SUBJECT will make hash indexes reliable and usable on standby. > AFAIU, currently hash indexes are not recommended to be used in > production mainly because they are not crash-safe and with this patch, > I hope we can address that limitation and recommend them for use in > production. > > This patch is built on my earlier patch [1] of making hash indexes > concurrent. The main reason for doing so is that the earlier patch > allows to complete the split operation and used light-weight locking > due to which operations can be logged at granular level. > > WAL for different operations: > > This has been explained in README as well, but I am again writing it > here for the ease of people. > > .. > One of the challenge in writing this patch was that the current code > was not written with a mindset that we need to write WAL for different > operations. Typical example is _hash_addovflpage() where pages are > modified across different function calls and all modifications needs > to be done atomically, so I have to refactor some code so that the > operations can be logged sensibly. > This patch has not done handling for OldSnapshot. Previously, we haven't done TestForOldSnapshot() checks in hash index as they were not logged, but now with this patch, it makes sense to perform such checks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi All, Following are the steps that i have followed to verify the WAL Logging of hash index, 1. I used Mithun's patch to improve coverage of hash index code [1] to verify the WAL Logging of hash index. Firstly i have confirmed if all the XLOG records associated with hash index are being covered or not using this patch. In case if any of the XLOG record for hash index operation is not being covered i have added a testcase for it. I have found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was not being covered and added a small testcase for the same. The patch for this is available @ [2]. 2. I executed the regression test suite and found all the hash indexes that are getting created as a part of regression test suite using the below query. SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx ON idx.oid = t.relam WHERE idx.amname = 'hash'; 3. Thirdly, I have calculated the number of pages associated with each hash index and compared every page of hash index on master and standby server using pg_filedump tool. As for example if the number of pages associated with 'con_hash_index' is 10 then here is what i did, On master: ----------------- select pg_relation_filepath('con_hash_index');pg_relation_filepath ----------------------base/16408/16433 (1 row) ./pg_filedump -if -R 0 9 /home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433 > /tmp/file1 On Slave: --------------- select pg_relation_filepath('con_hash_index');pg_relation_filepath ----------------------base/16408/16433 (1 row) ./pg_filedump -if -R 0 9 /home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433 > /tmp/file2 compared file1 and file2 using some diff tool. Following are the list of hash indexes that got created inside regression database when regression test suite was executed on a master server. hash_i4_index hash_name_index hash_txt_index hash_f8_index con_hash_index hash_idx In short, this is all i did and found no issues during testing. Please let me know if you need any further details. I would like to Thank Amit for his support and guidance during the testing phase. [1] - https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAE9k0PkNjryhSiG53mjnKFhi%2BMipJMjSa%3DYkH-UeW3bfr1HPJQ%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com
Hi Amit,
Thanks for working on this.
When building with --enable-cassert, I get compiler warning:
hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function
After an intentionally created crash, I get an Assert triggering:
TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
freep[0] is zero and bitmapbit is 16.
With this backtrace:
(gdb) bt
#0 0x0000003838c325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003838c33dc5 in abort () at abort.c:92
#2 0x000000000081a8fd in ExceptionalCondition (conditionName=<value optimized out>, errorType=<value optimized out>, fileName=<value optimized out>,
lineNumber=<value optimized out>) at assert.c:54
#3 0x00000000004a4199 in _hash_freeovflpage (rel=0x7f3f745d86b8, bucketbuf=198, ovflbuf=199, wbuf=198, itups=0x7ffc258fa090, itup_offsets=0x126e8a8,
tups_size=0x7ffc258f93d0, nitups=70, bstrategy=0x12ba320) at hashovfl.c:553
#4 0x00000000004a4c32 in _hash_squeezebucket (rel=<value optimized out>, bucket=38, bucket_blkno=56, bucket_buf=198, bstrategy=0x12ba320)
at hashovfl.c:1010
#5 0x00000000004a042a in hashbucketcleanup (rel=0x7f3f745d86b8, bucket_buf=198, bucket_blkno=56, bstrategy=0x12ba320, maxbucket=96, highmask=127,
lowmask=63, tuples_removed=0x7ffc258fc1c8, num_index_tuples=0x7ffc258fc1c0, bucket_has_garbage=0 '\000', delay=1 '\001',
callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:937
#6 0x00000000004a07e7 in hashbulkdelete (info=0x7ffc258fc2b0, stats=0x0, callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:580
#7 0x00000000005e98c5 in lazy_vacuum_index (indrel=0x7f3f745d86b8, stats=0x126ecc0, vacrelstats=0x126e248) at vacuumlazy.c:1599
#8 0x00000000005ea7f9 in lazy_scan_heap (onerel=<value optimized out>, options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized out>)
at vacuumlazy.c:1291
#9 lazy_vacuum_rel (onerel=<value optimized out>, options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized out>) at vacuumlazy.c:255
#10 0x00000000005e8939 in vacuum_rel (relid=17329, relation=0x7ffc258fcbd0, options=99, params=0x12ba290) at vacuum.c:1399
#11 0x00000000005e8d01 in vacuum (options=99, relation=0x7ffc258fcbd0, relid=<value optimized out>, params=0x12ba290, va_cols=0x0,
bstrategy=<value optimized out>, isTopLevel=1 '\001') at vacuum.c:307
#12 0x00000000006a07f1 in autovacuum_do_vac_analyze () at autovacuum.c:2823
#13 do_autovacuum () at autovacuum.c:2341
#14 0x00000000006a0f9c in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1656
#15 0x00000000006a1116 in StartAutoVacWorker () at autovacuum.c:1461
#16 0x00000000006afb00 in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5323
#17 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5009
#18 <signal handler called>
#19 0x0000003838ce1503 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#20 0x00000000006b0ec0 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1657
#21 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1301
#22 0x0000000000632e88 in main (argc=4, argv=0x11f4d50) at main.c:228
Cheers,
Jeff
On 23/08/16 15:24, Amit Kapila wrote: > > Thoughts? > > Note - To use this patch, first apply latest version of concurrent > hash index patch [2]. > > [1] - https://commitfest.postgresql.org/10/647/ > [2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com > > Firstly - really nice! Patches applied easily etc to latest version 10 checkout. I thought I'd test by initializing pgbench schema, adding a standby, then adding some hash indexes and running pgbench: Looking on the standby: bench=# \d pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers ----------+---------------+----------- aid | integer | not null bid | integer | abalance | integer | filler | character(84) | Indexes: "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) "pgbench_accounts_bid" hash (bid) <==== bench=# \d pgbench_history Table "public.pgbench_history" Column | Type | Modifiers --------+-----------------------------+----------- tid | integer | bid | integer | aid | integer | delta | integer | mtime | timestamp without time zone| filler | character(22) | Indexes: "pgbench_history_bid" hash (bid) <===== they have been replicated there ok. However I'm seeing a hang on the master after a while: bench=# SELECT datname,application_name,state,now()-xact_start AS wait,query FROM pg_stat_activity ORDER BY wait DESC; datname | application_name | state | wait | query ---------+------------------+--------+-----------------+---------------------------------------------------------------------------------------------------------------- | walreceiver | idle | | bench | pgbench | active | 00:31:38.367467 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, -3868, CURRENT_TIMESTAMP); bench | pgbench | active | 00:31:38.215378 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814, 2091, CURRENT_TIMESTAMP); bench | pgbench | active | 00:31:35.991056 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237, 3438, CURRENT_TIMESTAMP); bench | pgbench | active | 00:31:35.619798 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994, -2443, CURRENT_TIMESTAMP); bench | pgbench | active | 00:31:35.544196 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053, CURRENT_TIMESTAMP); bench | pgbench | active | 00:31:35.334504 | UPDATE pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33; bench | pgbench | active | 00:31:35.234112 |UPDATE pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38; bench | pgbench | active | 00:31:34.434676 | UPDATE pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33; bench | psql | active | 00:00:00 | SELECT datname,application_name,state,now()-xact_start AS wait,query FROM pg_stat_activity ORDER BY wait DESC; (10 rows) but no errors in the logs, any thoughts? Cheers Mark
On 24/08/16 12:09, Mark Kirkwood wrote: > On 23/08/16 15:24, Amit Kapila wrote: >> >> Thoughts? >> >> Note - To use this patch, first apply latest version of concurrent >> hash index patch [2]. >> >> [1] - https://commitfest.postgresql.org/10/647/ >> [2] - >> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com >> >> > > Firstly - really nice! Patches applied easily etc to latest version 10 > checkout. > > I thought I'd test by initializing pgbench schema, adding a standby, > then adding some hash indexes and running pgbench: > > Looking on the standby: > > bench=# \d pgbench_accounts > Table "public.pgbench_accounts" > Column | Type | Modifiers > ----------+---------------+----------- > aid | integer | not null > bid | integer | > abalance | integer | > filler | character(84) | > Indexes: > "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) > "pgbench_accounts_bid" hash (bid) <==== > > bench=# \d pgbench_history > Table "public.pgbench_history" > Column | Type | Modifiers > --------+-----------------------------+----------- > tid | integer | > bid | integer | > aid | integer | > delta | integer | > mtime | timestamp without time zone | > filler | character(22) | > Indexes: > "pgbench_history_bid" hash (bid) <===== > > > they have been replicated there ok. > > However I'm seeing a hang on the master after a while: > > bench=# SELECT datname,application_name,state,now()-xact_start AS > wait,query FROM pg_stat_activity ORDER BY wait DESC; > datname | application_name | state | wait | query > ---------+------------------+--------+-----------------+---------------------------------------------------------------------------------------------------------------- > > | walreceiver | idle | | > bench | pgbench | active | 00:31:38.367467 | INSERT INTO > pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, > -3868, CURRENT_TIMESTAMP); > bench | pgbench | active | 00:31:38.215378 | INSERT INTO > pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, > 3954814, 2091, CURRENT_TIMESTAMP); > bench | pgbench | active | 00:31:35.991056 | INSERT INTO > pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, > 8355237, 3438, CURRENT_TIMESTAMP); > bench | pgbench | active | 00:31:35.619798 | INSERT INTO > pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, > 4839994, -2443, CURRENT_TIMESTAMP); > bench | pgbench | active | 00:31:35.544196 | INSERT INTO > pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, > 4053, CURRENT_TIMESTAMP); > bench | pgbench | active | 00:31:35.334504 | UPDATE > pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33; > bench | pgbench | active | 00:31:35.234112 | UPDATE > pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38; > bench | pgbench | active | 00:31:34.434676 | UPDATE > pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33; > bench | psql | active | 00:00:00 | SELECT > datname,application_name,state,now()-xact_start AS wait,query FROM > pg_stat_activity ORDER BY wait DESC; > (10 rows) > > but no errors in the logs, any thoughts? FWIW, retesting with the wal logging patch removed (i.e leaving the concurrent hast one) works fine.
On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 24/08/16 12:09, Mark Kirkwood wrote: >> >> On 23/08/16 15:24, Amit Kapila wrote: >>> >>> >>> Thoughts? >>> >>> Note - To use this patch, first apply latest version of concurrent >>> hash index patch [2]. >>> >>> [1] - https://commitfest.postgresql.org/10/647/ >>> [2] - >>> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com >>> >>> >> >> Firstly - really nice! Patches applied easily etc to latest version 10 >> checkout. >> >> I thought I'd test by initializing pgbench schema, adding a standby, then >> adding some hash indexes and running pgbench: >> >> Looking on the standby: >> >> bench=# \d pgbench_accounts >> Table "public.pgbench_accounts" >> Column | Type | Modifiers >> ----------+---------------+----------- >> aid | integer | not null >> bid | integer | >> abalance | integer | >> filler | character(84) | >> Indexes: >> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) >> "pgbench_accounts_bid" hash (bid) <==== >> >> bench=# \d pgbench_history >> Table "public.pgbench_history" >> Column | Type | Modifiers >> --------+-----------------------------+----------- >> tid | integer | >> bid | integer | >> aid | integer | >> delta | integer | >> mtime | timestamp without time zone | >> filler | character(22) | >> Indexes: >> "pgbench_history_bid" hash (bid) <===== >> >> >> they have been replicated there ok. >> >> However I'm seeing a hang on the master after a while: >> >> bench=# SELECT datname,application_name,state,now()-xact_start AS >> wait,query FROM pg_stat_activity ORDER BY wait DESC; >> datname | application_name | state | wait | query >> >> ---------+------------------+--------+-----------------+---------------------------------------------------------------------------------------------------------------- >> | walreceiver | idle | | >> bench | pgbench | active | 00:31:38.367467 | INSERT INTO >> pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, >> -3868, CURRENT_TIMESTAMP); >> bench | pgbench | active | 00:31:38.215378 | INSERT INTO >> pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814, >> 2091, CURRENT_TIMESTAMP); >> bench | pgbench | active | 00:31:35.991056 | INSERT INTO >> pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237, >> 3438, CURRENT_TIMESTAMP); >> bench | pgbench | active | 00:31:35.619798 | INSERT INTO >> pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994, >> -2443, CURRENT_TIMESTAMP); >> bench | pgbench | active | 00:31:35.544196 | INSERT INTO >> pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053, >> CURRENT_TIMESTAMP); >> bench | pgbench | active | 00:31:35.334504 | UPDATE >> pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33; >> bench | pgbench | active | 00:31:35.234112 | UPDATE >> pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38; >> bench | pgbench | active | 00:31:34.434676 | UPDATE >> pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33; >> bench | psql | active | 00:00:00 | SELECT >> datname,application_name,state,now()-xact_start AS wait,query FROM >> pg_stat_activity ORDER BY wait DESC; >> (10 rows) >> >> but no errors in the logs, any thoughts? > Can you get the call stacks? > > > FWIW, retesting with the wal logging patch removed (i.e leaving the > concurrent hast one) works fine. > Okay, information noted. Thanks for testing and showing interest in the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 24/08/16 15:36, Amit Kapila wrote: > On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> On 24/08/16 12:09, Mark Kirkwood wrote: >>> On 23/08/16 15:24, Amit Kapila wrote: >>>> >>>> Thoughts? >>>> >>>> Note - To use this patch, first apply latest version of concurrent >>>> hash index patch [2]. >>>> >>>> [1] - https://commitfest.postgresql.org/10/647/ >>>> [2] - >>>> https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com >>>> >>>> >>> Firstly - really nice! Patches applied easily etc to latest version 10 >>> checkout. >>> >>> I thought I'd test by initializing pgbench schema, adding a standby, then >>> adding some hash indexes and running pgbench: >>> >>> Looking on the standby: >>> >>> bench=# \d pgbench_accounts >>> Table "public.pgbench_accounts" >>> Column | Type | Modifiers >>> ----------+---------------+----------- >>> aid | integer | not null >>> bid | integer | >>> abalance | integer | >>> filler | character(84) | >>> Indexes: >>> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) >>> "pgbench_accounts_bid" hash (bid) <==== >>> >>> bench=# \d pgbench_history >>> Table "public.pgbench_history" >>> Column | Type | Modifiers >>> --------+-----------------------------+----------- >>> tid | integer | >>> bid | integer | >>> aid | integer | >>> delta | integer | >>> mtime | timestamp without time zone | >>> filler | character(22) | >>> Indexes: >>> "pgbench_history_bid" hash (bid) <===== >>> >>> >>> they have been replicated there ok. >>> >>> However I'm seeing a hang on the master after a while: >>> >>> bench=# SELECT datname,application_name,state,now()-xact_start AS >>> wait,query FROM pg_stat_activity ORDER BY wait DESC; >>> datname | application_name | state | wait | query >>> >>> ---------+------------------+--------+-----------------+---------------------------------------------------------------------------------------------------------------- >>> | walreceiver | idle | | >>> bench | pgbench | active | 00:31:38.367467 | INSERT INTO >>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, >>> -3868, CURRENT_TIMESTAMP); >>> bench | pgbench | active | 00:31:38.215378 | INSERT INTO >>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814, >>> 2091, CURRENT_TIMESTAMP); >>> bench | pgbench | active | 00:31:35.991056 | INSERT INTO >>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237, >>> 3438, CURRENT_TIMESTAMP); >>> bench | pgbench | active | 00:31:35.619798 | INSERT INTO >>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994, >>> -2443, CURRENT_TIMESTAMP); >>> bench | pgbench | active | 00:31:35.544196 | INSERT INTO >>> pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053, >>> CURRENT_TIMESTAMP); >>> bench | pgbench | active | 00:31:35.334504 | UPDATE >>> pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33; >>> bench | pgbench | active | 00:31:35.234112 | UPDATE >>> pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38; >>> bench | pgbench | active | 00:31:34.434676 | UPDATE >>> pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33; >>> bench | psql | active | 00:00:00 | SELECT >>> datname,application_name,state,now()-xact_start AS wait,query FROM >>> pg_stat_activity ORDER BY wait DESC; >>> (10 rows) >>> >>> but no errors in the logs, any thoughts? > Can you get the call stacks? > For every stuck backend? (just double checking)...
On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 24/08/16 15:36, Amit Kapila wrote: >> >> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood >> <mark.kirkwood@catalyst.net.nz> wrote: >>> >> >> Can you get the call stacks? >> > > For every stuck backend? (just double checking)... Yeah. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 24/08/16 16:33, Amit Kapila wrote: > On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> On 24/08/16 15:36, Amit Kapila wrote: >>> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood >>> <mark.kirkwood@catalyst.net.nz> wrote: >>> Can you get the call stacks? >>> >> For every stuck backend? (just double checking)... > Yeah. > Cool, I managed to reproduce with a reduced workload of 4 backends, then noticed that the traces for 3 of 'em were all the same. So I've attached the 2 unique ones, plus noted what pg_stat_activity thought the wait event was, in case that is useful. Cheers Mark
Attachment
On 24/08/16 16:52, Mark Kirkwood wrote: > On 24/08/16 16:33, Amit Kapila wrote: >> On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood >> <mark.kirkwood@catalyst.net.nz> wrote: >>> On 24/08/16 15:36, Amit Kapila wrote: >>>> On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood >>>> <mark.kirkwood@catalyst.net.nz> wrote: >>>> Can you get the call stacks? >>>> >>> For every stuck backend? (just double checking)... >> Yeah. >> > > Cool, > > I managed to reproduce with a reduced workload of 4 backends, then > noticed that the traces for 3 of 'em were all the same. So I've > attached the 2 unique ones, plus noted what pg_stat_activity thought > the wait event was, in case that is useful. ...actually I was wrong there, only 2 of them were the same. So I've attached a new log of bt's of them all.
Attachment
On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > Hi Amit, > > Thanks for working on this. > > When building with --enable-cassert, I get compiler warning: > > hash.c: In function 'hashbucketcleanup': > hash.c:722: warning: 'new_bucket' may be used uninitialized in this function > This warning is from concurrent index patch. I will fix it and post the patch on that thread. > > After an intentionally created crash, I get an Assert triggering: > > TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & > (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553) > > freep[0] is zero and bitmapbit is 16. > Here what is happening is that when it tries to clear the bitmapbit, it expects it to be set. Now, I think the reason for why it didn't find the bit as set could be that after the new overflow page is added and the bit corresponding to it is set, you might have crashed the system and the replay would not have set the bit. Then while freeing the overflow page it can hit the Assert as mentioned by you. I think the problem here could be that I am using REGBUF_STANDARD to log the bitmap page updates which seems to be causing the issue. As bitmap page doesn't follow the standard page layout, it would have omitted the actual contents while taking full page image and then during replay, it would not have set the bit, because page doesn't need REDO. I think here the fix is to use REGBUF_NO_IMAGE as we use for vm buffers. If you can send me the detailed steps for how you have produced the problem, then I can verify after fixing whether you are seeing the same problem or something else. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 24/08/16 17:01, Mark Kirkwood wrote: > > ...actually I was wrong there, only 2 of them were the same. So I've > attached a new log of bt's of them all. > > > And I can reproduce with only 1 session, figured that might be a helpful piece of the puzzle (trace attached).
Attachment
On Wed, Aug 24, 2016 at 11:44 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 24/08/16 17:01, Mark Kirkwood wrote: >> >> >> ...actually I was wrong there, only 2 of them were the same. So I've >> attached a new log of bt's of them all. >> >> >> > > And I can reproduce with only 1 session, figured that might be a helpful > piece of the puzzle (trace attached). > Thanks. I think I know the problem here. Basically _hash_freeovflpage() is trying to take a lock on a buffer previous to overflow page to update the links and it is quite possible that the same buffer is already locked for moving the tuples while squeezing the bucket. I am working on a fix for the same. Coincidently, Ashutosh Sharma a colleague of mine who was also testing this patch found the same issue by an attached sql script. So we might be able to inculcate a test case in the regression suite as well after fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> After an intentionally created crash, I get an Assert triggering:
>
> TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
> (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
>
> freep[0] is zero and bitmapbit is 16.
>
Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.
If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.
The test is rather awkward, it might be easier to just have me test it. But, I've attached it.
There is a patch that needs to applied and compiled (alongside your patches, of course), to inject the crashes. A perl script which creates the schema and does the updates. And a shell script which sets up the cluster with the appropriate parameters, and then calls the perl script in a loop.
The top of the shell script has some hard coded paths to the binaries, and to the test data directory (which is automatically deleted)
I run it like "sh do.sh >& do.err &"
It gives two different types of assertion failures:
$ fgrep TRAP: do.err |sort|uniq -c
21 TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
32 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
The second one is related to the intentional crashes, and so is not relevant to you.
Cheers,
Jeff
Attachment
Amit Kapila wrote: > $SUBJECT will make hash indexes reliable and usable on standby. Nice work. Can you split the new xlog-related stuff to a new file, say hash_xlog.h, instead of cramming it in hash.h? Removing the existing #include "xlogreader.h" from hash.h would be nice. I volunteer for pushing any preliminary header cleanup commits. I think it's silly that access/hash.h is the go-to header for hash usage, including hash_any(). Not this patch's fault, of course, just venting. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Kapila wrote: >> $SUBJECT will make hash indexes reliable and usable on standby. > > Nice work. > > Can you split the new xlog-related stuff to a new file, say hash_xlog.h, > instead of cramming it in hash.h? Removing the existing #include > "xlogreader.h" from hash.h would be nice. I volunteer for pushing any > preliminary header cleanup commits. > So, what you are expecting here is that we create hash_xlog.h and move necessary functions like hash_redo(), hash_desc() and hash_identify() to it. Then do whatever else is required to build it successfully. Once that is done, I can build my patches on top of it. Is that right? If yes, I can work on it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila wrote: > On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Can you split the new xlog-related stuff to a new file, say hash_xlog.h, > > instead of cramming it in hash.h? Removing the existing #include > > "xlogreader.h" from hash.h would be nice. I volunteer for pushing any > > preliminary header cleanup commits. > > So, what you are expecting here is that we create hash_xlog.h and move > necessary functions like hash_redo(), hash_desc() and hash_identify() > to it. Then do whatever else is required to build it successfully. > Once that is done, I can build my patches on top of it. Is that > right? If yes, I can work on it. Yes, thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 25, 2016 at 6:54 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Kapila wrote: >> On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: > >> > Can you split the new xlog-related stuff to a new file, say hash_xlog.h, >> > instead of cramming it in hash.h? Removing the existing #include >> > "xlogreader.h" from hash.h would be nice. I volunteer for pushing any >> > preliminary header cleanup commits. >> >> So, what you are expecting here is that we create hash_xlog.h and move >> necessary functions like hash_redo(), hash_desc() and hash_identify() >> to it. Then do whatever else is required to build it successfully. >> Once that is done, I can build my patches on top of it. Is that >> right? If yes, I can work on it. > > Yes, thanks. > How about attached? If you want, I think we can one step further and move hash_redo to a new file hash_xlog.c which is required for main patch, but we can leave it for later as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Amit Kapila wrote: > How about attached? That works; pushed. (I removed a few #includes from the new file.) > If you want, I think we can one step further and move hash_redo to a > new file hash_xlog.c which is required for main patch, but we can > leave it for later as well. I think that can be a part of the main patch. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 30, 2016 at 3:40 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Kapila wrote: > >> How about attached? > > That works; pushed. Thanks. > (I removed a few #includes from the new file.) > oops, copied from hash.h and forgot to remove those. >> If you want, I think we can one step further and move hash_redo to a >> new file hash_xlog.c which is required for main patch, but we can >> leave it for later as well. > > I think that can be a part of the main patch. > Okay, makes sense. Any suggestions on splitting the main patch or in general on design/implementation is welcome. I will rebase the patches on the latest commit. Current status is that, I have fixed the problems reported by Jeff Janes and Mark Kirkwood. Currently, we are doing the stress testing of the patch using pgbench, once that is done, will post a new version. I am hoping that it will be complete in this week. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> > >> > After an intentionally created crash, I get an Assert triggering: >> > >> > TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & >> > (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553) >> > >> > freep[0] is zero and bitmapbit is 16. >> > >> >> Here what is happening is that when it tries to clear the bitmapbit, >> it expects it to be set. Now, I think the reason for why it didn't >> find the bit as set could be that after the new overflow page is added >> and the bit corresponding to it is set, you might have crashed the >> system and the replay would not have set the bit. Then while freeing >> the overflow page it can hit the Assert as mentioned by you. I think >> the problem here could be that I am using REGBUF_STANDARD to log the >> bitmap page updates which seems to be causing the issue. As bitmap >> page doesn't follow the standard page layout, it would have omitted >> the actual contents while taking full page image and then during >> replay, it would not have set the bit, because page doesn't need REDO. >> I think here the fix is to use REGBUF_NO_IMAGE as we use for vm >> buffers. >> >> If you can send me the detailed steps for how you have produced the >> problem, then I can verify after fixing whether you are seeing the >> same problem or something else. > > > > The test is rather awkward, it might be easier to just have me test it. > Okay, I have fixed this issue as explained above. Apart from that, I have fixed another issue reported by Mark Kirkwood upthread and few other issues found during internal testing by Ashutosh Sharma. The locking issue reported by Mark and Ashutosh is that the patch didn't maintain the locking order while adding overflow page as it maintains in other write operations (lock the bucket pages first and then metapage to perform the write operation). I have added the comments in _hash_addovflpage() to explain the locking order used in modified patch. During stress testing with pgbench using master-standby setup, we found an issue which indicates that WAL replay machinery doesn't expect completely zeroed pages (See explanation of RBM_NORMAL mode atop XLogReadBufferExtended). Previously before freeing the overflow page we were zeroing it, now I have changed it to just initialize the page such that the page will be empty. Apart from above, I have added support for old snapshot threshold in the hash index code. Thanks to Ashutosh Sharma for doing the testing of the patch and helping me in analyzing some of the above issues. I forgot to mention in my initial mail that Robert and I had some off-list discussions about the design of this patch, many thanks to him for providing inputs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Sep 7, 2016 at 3:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Okay, I have fixed this issue as explained above. Apart from that, I > have fixed another issue reported by Mark Kirkwood upthread and few > other issues found during internal testing by Ashutosh Sharma. > Forgot to mention that you need to apply the latest concurrent hash index patch [1] before applying this patch. [1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm%2Bqn0jZX31-obXrJw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> Thanks to Ashutosh Sharma for doing the testing of the patch and > helping me in analyzing some of the above issues. Hi All, I would like to summarize the test-cases that i have executed for validating WAL logging in hash index feature. 1) I have mainly ran the pgbench test with read-write workload at the scale factor of 1000 and various client counts like 16, 64 and 128 for time duration of 30 mins, 1 hr and 24 hrs. I have executed this test on highly configured power2 machine with 128 cores and 512GB of RAM. I ran the test-case both with and without the replication setup. Please note that i have changed the schema of pgbench tables created during initialisation phase. The new schema of pgbench tables looks as shown below on both master and standby: postgres=# \d pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers ----------+---------------+-----------aid | integer | not nullbid | integer |abalance | integer |filler | character(84) | Indexes: "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) "pgbench_accounts_bid" hash (bid) postgres=# \d pgbench_history Table "public.pgbench_history"Column | Type | Modifiers --------+-----------------------------+-----------tid | integer |bid | integer |aid | integer |delta | integer |mtime | timestamp without time zone |filler| character(22) | Indexes: "pgbench_history_bid" hash (bid) 2) I have also made use of the following tools for analyzing the hash index tables created on master and standby. a) pgstattuple : Using this tool, i could verfiy the tuple level statistics which includes index table physical length, dead tuple percentageand other infos on both master and standby. However, i could see below error message when using this tool for one of the hash index table on both master and standby server. postgres=# SELECT * FROM pgstattuple('pgbench_history_bid'); ERROR: index "pgbench_history_bid" contains unexpected zero page at block 2482297 To investigate on this, i made use of pageinspect contrib module and came to know that above block contains an uninitialised page. But, this is quite possible in hash index as here the bucket split happens in the power-of-2 and we may find some of the uninitialised bucket pages. b) pg_filedump : Using this tool, i could take a dump of all the hash index tables on master and standy and compare the dump file to verify if there is any difference between hash index pages on both master and standby. In short, this is all i did to verify the patch for wal logging in hash index. I would like thank Amit and Robert for their valuable inputs during the hash index testing phase. I am also planning to verify wal logging in hash index using Kuntal's patch for WAL consistency check once it is stablized. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com
On 07/09/16 21:58, Amit Kapila wrote: > On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >>> On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>>> After an intentionally created crash, I get an Assert triggering: >>>> >>>> TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] & >>>> (1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553) >>>> >>>> freep[0] is zero and bitmapbit is 16. >>>> >>> Here what is happening is that when it tries to clear the bitmapbit, >>> it expects it to be set. Now, I think the reason for why it didn't >>> find the bit as set could be that after the new overflow page is added >>> and the bit corresponding to it is set, you might have crashed the >>> system and the replay would not have set the bit. Then while freeing >>> the overflow page it can hit the Assert as mentioned by you. I think >>> the problem here could be that I am using REGBUF_STANDARD to log the >>> bitmap page updates which seems to be causing the issue. As bitmap >>> page doesn't follow the standard page layout, it would have omitted >>> the actual contents while taking full page image and then during >>> replay, it would not have set the bit, because page doesn't need REDO. >>> I think here the fix is to use REGBUF_NO_IMAGE as we use for vm >>> buffers. >>> >>> If you can send me the detailed steps for how you have produced the >>> problem, then I can verify after fixing whether you are seeing the >>> same problem or something else. >> >> >> The test is rather awkward, it might be easier to just have me test it. >> > Okay, I have fixed this issue as explained above. Apart from that, I > have fixed another issue reported by Mark Kirkwood upthread and few > other issues found during internal testing by Ashutosh Sharma. > > The locking issue reported by Mark and Ashutosh is that the patch > didn't maintain the locking order while adding overflow page as it > maintains in other write operations (lock the bucket pages first and > then metapage to perform the write operation). I have added the > comments in _hash_addovflpage() to explain the locking order used in > modified patch. > > During stress testing with pgbench using master-standby setup, we > found an issue which indicates that WAL replay machinery doesn't > expect completely zeroed pages (See explanation of RBM_NORMAL mode > atop XLogReadBufferExtended). Previously before freeing the overflow > page we were zeroing it, now I have changed it to just initialize the > page such that the page will be empty. > > Apart from above, I have added support for old snapshot threshold in > the hash index code. > > Thanks to Ashutosh Sharma for doing the testing of the patch and > helping me in analyzing some of the above issues. > > I forgot to mention in my initial mail that Robert and I had some > off-list discussions about the design of this patch, many thanks to > him for providing inputs. > > Repeating my tests with these new patches applied points to the hang issue being solved. I tested several 10 minute runs (any of which was enough to elicit the hang previously). I'll do some longer ones, but looks good! regards Mark
On Thu, Sep 8, 2016 at 10:02 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > Repeating my tests with these new patches applied points to the hang issue > being solved. I tested several 10 minute runs (any of which was enough to > elicit the hang previously). I'll do some longer ones, but looks good! > Thanks for verifying the issue and doing further testing of the patch. It is really helpful. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I've run my original test for a while now and have not seen any problems. But I realized I forgot to compile with enable-casserts, to I will have to redo it to make sure the assertion failures have been fixed. In my original testing I did very rarely get a deadlock (or some kind of hang), and I haven't seen that again so far. It was probably the same source as the one Mark observed, and so the same fix.
Cheers,
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.
Hi All,
I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.
1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.
Please note that i have changed the schema of pgbench tables created
during initialisation phase.
The new schema of pgbench tables looks as shown below on both master
and standby:
postgres=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)
postgres=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid)
Hi Ashutosh,
This schema will test the maintenance of hash indexes, but it will never use hash indexes for searching, so it limits the amount of test coverage you will get. While searching shouldn't generate novel types of WAL records (that I know of), it will generate locking and timing issues that might uncover bugs (if there are any left to uncover, of course).
I would drop the primary key on pgbench_accounts and replace it with a hash index and test it that way (except I don't have a 128 core machine at my disposal, so really I am suggesting that you do this...)
The lack of primary key and the non-uniqueness of the hash index should not be an operational problem, because the built in pgbench runs never attempt to violate the constraints anyway.
In fact, I'd replace all of the indexes on the rest of the pgbench tables with hash indexes, too, just for additional testing.
In fact, I'd replace all of the indexes on the rest of the pgbench tables with hash indexes, too, just for additional testing.
I plan to do testing using my own testing harness after changing it to insert a lot of dummy tuples (ones with negative values in the pseudo-pk column, which are never queried by the core part of the harness) and deleting them at random intervals. I think that none of pgbench's built in tests are likely to give the bucket splitting and squeezing code very much exercise.
Is there a way to gather statistics on how many of each type of WAL record are actually getting sent over the replication link? The only way I can think of is to turn on wal archving as well as replication, then using pg_xlogdump to gather the stats.
I've run my original test for a while now and have not seen any problems. But I realized I forgot to compile with enable-casserts, to I will have to redo it to make sure the assertion failures have been fixed. In my original testing I did very rarely get a deadlock (or some kind of hang), and I haven't seen that again so far. It was probably the same source as the one Mark observed, and so the same fix.
Cheers,
Jeff
On 09/09/16 07:09, Jeff Janes wrote: > On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com > <mailto:ashu.coek88@gmail.com>> wrote: > > > Thanks to Ashutosh Sharma for doing the testing of the patch and > > helping me in analyzing some of the above issues. > > Hi All, > > I would like to summarize the test-cases that i have executed for > validating WAL logging in hash index feature. > > 1) I have mainly ran the pgbench test with read-write workload at the > scale factor of 1000 and various client counts like 16, 64 and 128 for > time duration of 30 mins, 1 hr and 24 hrs. I have executed this test > on highly configured power2 machine with 128 cores and 512GB of RAM. I > ran the test-case both with and without the replication setup. > > Please note that i have changed the schema of pgbench tables created > during initialisation phase. > > The new schema of pgbench tables looks as shown below on both master > and standby: > > postgres=# \d pgbench_accounts > Table "public.pgbench_accounts" > Column | Type | Modifiers > ----------+---------------+----------- > aid | integer | not null > bid | integer | > abalance | integer | > filler | character(84) | > Indexes: > "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) > "pgbench_accounts_bid" hash (bid) > > postgres=# \d pgbench_history > Table "public.pgbench_history" > Column | Type | Modifiers > --------+-----------------------------+----------- > tid | integer | > bid | integer | > aid | integer | > delta | integer | > mtime | timestamp without time zone | > filler | character(22) | > Indexes: > "pgbench_history_bid" hash (bid) > > > Hi Ashutosh, > > This schema will test the maintenance of hash indexes, but it will > never use hash indexes for searching, so it limits the amount of test > coverage you will get. While searching shouldn't generate novel types > of WAL records (that I know of), it will generate locking and timing > issues that might uncover bugs (if there are any left to uncover, of > course). > > I would drop the primary key on pgbench_accounts and replace it with a > hash index and test it that way (except I don't have a 128 core > machine at my disposal, so really I am suggesting that you do this...) > > The lack of primary key and the non-uniqueness of the hash index > should not be an operational problem, because the built in pgbench > runs never attempt to violate the constraints anyway. > > In fact, I'd replace all of the indexes on the rest of the pgbench > tables with hash indexes, too, just for additional testing. > > I plan to do testing using my own testing harness after changing it to > insert a lot of dummy tuples (ones with negative values in the > pseudo-pk column, which are never queried by the core part of the > harness) and deleting them at random intervals. I think that none of > pgbench's built in tests are likely to give the bucket splitting and > squeezing code very much exercise. > > Is there a way to gather statistics on how many of each type of WAL > record are actually getting sent over the replication link? The only > way I can think of is to turn on wal archving as well as replication, > then using pg_xlogdump to gather the stats. > > I've run my original test for a while now and have not seen any > problems. But I realized I forgot to compile with enable-casserts, to > I will have to redo it to make sure the assertion failures have been > fixed. In my original testing I did very rarely get a deadlock (or > some kind of hang), and I haven't seen that again so far. It was > probably the same source as the one Mark observed, and so the same fix. > > Cheers, > > Jeff Yeah, good suggestion about replacing (essentially) all the indexes with hash ones and testing. I did some short runs with this type of schema yesterday (actually to get a feel for if hash performance vs btree was compareable - does seem tp be) - but probably longer ones with higher concurrency (as high as I can manage on a single socket i7 anyway) is a good plan. If Ashutosh has access to seriously large numbers of cores then that is even better :-) Cheers Mark
On Fri, Sep 9, 2016 at 12:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > I plan to do testing using my own testing harness after changing it to > insert a lot of dummy tuples (ones with negative values in the pseudo-pk > column, which are never queried by the core part of the harness) and > deleting them at random intervals. I think that none of pgbench's built in > tests are likely to give the bucket splitting and squeezing code very much > exercise. > Hash index tests [1] written by Mithun does cover part of that code and we have done testing by extending those tests to cover splitting and squeezing part of code. > Is there a way to gather statistics on how many of each type of WAL record > are actually getting sent over the replication link? The only way I can > think of is to turn on wal archving as well as replication, then using > pg_xlogdump to gather the stats. > Sounds sensible, but what do you want to know by getting the number of each type of WAL records? I understand it is important to cover all the WAL records for hash index (and I think Ashutosh has done that during his tests [2]), but may be sending multiple times same record could further strengthen the validation. > I've run my original test for a while now and have not seen any problems. > But I realized I forgot to compile with enable-casserts, to I will have to > redo it to make sure the assertion failures have been fixed. In my original > testing I did very rarely get a deadlock (or some kind of hang), and I > haven't seen that again so far. It was probably the same source as the one > Mark observed, and so the same fix. > Thanks for the verification. [1] - https://commitfest.postgresql.org/10/716/ [2] - https://www.postgresql.org/message-id/CAE9k0PkPumi4iWFuD%2BjHHkpcxn531%3DDJ8uH0dctsvF%2BdaZY6yQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 09/09/16 14:50, Mark Kirkwood wrote: > > Yeah, good suggestion about replacing (essentially) all the indexes > with hash ones and testing. I did some short runs with this type of > schema yesterday (actually to get a feel for if hash performance vs > btree was compareable - does seem tp be) - but probably longer ones > with higher concurrency (as high as I can manage on a single socket i7 > anyway) is a good plan. If Ashutosh has access to seriously large > numbers of cores then that is even better :-) > > I managed to find a slightly bigger server (used our openstack cloud to run a 16 cpu vm). With the schema modified as follows: bench=# \d pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers ----------+---------------+----------- aid | integer | not null bid | integer | abalance | integer | filler | character(84) | Indexes: "pgbench_accounts_pkey" hash (aid) bench=# \d pgbench_branches Table "public.pgbench_branches" Column | Type | Modifiers ----------+---------------+----------- bid | integer | not null bbalance | integer | filler | character(88)| Indexes: "pgbench_branches_pkey" hash (bid) bench=# \d pgbench_tellers Table "public.pgbench_tellers" Column | Type | Modifiers ----------+---------------+----------- tid | integer | not null bid | integer | tbalance | integer | filler | character(84) | Indexes: "pgbench_tellers_pkey" hash (tid) bench=# \d pgbench_history Table "public.pgbench_history" Column | Type | Modifiers --------+-----------------------------+----------- tid | integer | bid | integer | aid | integer | delta | integer | mtime | timestamp without time zone| filler | character(22) | Indexes: "pgbench_history_pkey" hash (bid) performed several 10 hour runs on size 100 database using 32 and 64 clients. For the last run I rebuilt with assertions enabled. No hangs or assertion failures. regards Mark
On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > > performed several 10 hour runs on size 100 database using 32 and 64 clients. > For the last run I rebuilt with assertions enabled. No hangs or assertion > failures. > Thanks for verification. Do you think we can do some read-only workload benchmarking using this server? If yes, then probably you can use concurrent hash index patch [1] and cache the metapage patch [2] (I think Mithun needs to rebase his patch) to do so. [1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com [2] - https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/09/16 17:01, Amit Kapila wrote: > On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> >> performed several 10 hour runs on size 100 database using 32 and 64 clients. >> For the last run I rebuilt with assertions enabled. No hangs or assertion >> failures. >> > Thanks for verification. Do you think we can do some read-only > workload benchmarking using this server? If yes, then probably you > can use concurrent hash index patch [1] and cache the metapage patch > [2] (I think Mithun needs to rebase his patch) to do so. > > > > [1] - https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com > [2] - https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com > > I can do - are we checking checking for hangs/assertions or comparing patched vs unpatched performance (for the metapage patch)? regards Mark
On 11/09/16 19:16, Mark Kirkwood wrote: > > > On 11/09/16 17:01, Amit Kapila wrote: >> ...Do you think we can do some read-only >> workload benchmarking using this server? If yes, then probably you >> can use concurrent hash index patch [1] and cache the metapage patch >> [2] (I think Mithun needs to rebase his patch) to do so. >> >> >> >> [1] - >> https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com >> [2] - >> https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com >> >> > > I can do - are we checking checking for hangs/assertions or comparing > patched vs unpatched performance (for the metapage patch)? > > So, assuming the latter - testing performance with and without the metapage patch: For my 1st runs: - cpus 16, ran 16G - size 100, clients 32 I'm seeing no difference in performance for read only (-S) pgbench workload (with everybody using has indexes). I guess not that surprising as the db fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset (suspect size 2000 is needed). regards Mark
On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
I plan to do testing using my own testing harness after changing it to insert a lot of dummy tuples (ones with negative values in the pseudo-pk column, which are never queried by the core part of the harness) and deleting them at random intervals. I think that none of pgbench's built in tests are likely to give the bucket splitting and squeezing code very much exercise.
I've implemented this, by adding lines 197 through 202 to the count.pl script. (I'm reattaching the test case)
Within a few minutes of testing, I start getting Errors like these:
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set count=count+1 where index=$1
In one test, I also got an error from my test harness itself indicating tuples are transiently missing from the index, starting an hour into a test:
child abnormal exit update did not update 1 row: key 9555 updated 0E0 at count.pl line 194.\n at count.pl line 208.
Those key values should always find exactly one row to update.
If the tuples were permanently missing from the index, I would keep getting errors on the same key values very frequently. But I don't get that, the errors remain infrequent and are on different value each time, so I think the tuples are in the index but the scan somehow misses them, either while the bucket is being split or while it is being squeezed.
This on a build without enable-asserts.
Any ideas on how best to go about investigating this?
Cheers,
Jeff
Attachment
On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> >> I plan to do testing using my own testing harness after changing it to >> insert a lot of dummy tuples (ones with negative values in the pseudo-pk >> column, which are never queried by the core part of the harness) and >> deleting them at random intervals. I think that none of pgbench's built in >> tests are likely to give the bucket splitting and squeezing code very much >> exercise. > > > > I've implemented this, by adding lines 197 through 202 to the count.pl > script. (I'm reattaching the test case) > > Within a few minutes of testing, I start getting Errors like these: > > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not > owned by resource owner Portal > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set > count=count+1 where index=$1 > > > In one test, I also got an error from my test harness itself indicating > tuples are transiently missing from the index, starting an hour into a test: > > child abnormal exit update did not update 1 row: key 9555 updated 0E0 at > count.pl line 194.\n at count.pl line 208. > child abnormal exit update did not update 1 row: key 8870 updated 0E0 at > count.pl line 194.\n at count.pl line 208. > child abnormal exit update did not update 1 row: key 8453 updated 0E0 at > count.pl line 194.\n at count.pl line 208. > > Those key values should always find exactly one row to update. > > If the tuples were permanently missing from the index, I would keep getting > errors on the same key values very frequently. But I don't get that, the > errors remain infrequent and are on different value each time, so I think > the tuples are in the index but the scan somehow misses them, either while > the bucket is being split or while it is being squeezed. > > This on a build without enable-asserts. > > Any ideas on how best to go about investigating this? > I think these symptoms indicate the bug in concurrent hash index patch, but it could be that the problem can be only revealed with WAL patch. Is it possible to just try this with concurrent hash index patch? In any case, thanks for testing it, I will look into these issues. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Sep 11, 2016 at 3:01 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 11/09/16 19:16, Mark Kirkwood wrote: > >> >> >> On 11/09/16 17:01, Amit Kapila wrote: >>> >>> ...Do you think we can do some read-only >>> workload benchmarking using this server? If yes, then probably you >>> can use concurrent hash index patch [1] and cache the metapage patch >>> [2] (I think Mithun needs to rebase his patch) to do so. >>> >>> >>> >>> [1] - >>> https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com >>> [2] - >>> https://www.postgresql.org/message-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com >>> >>> >> >> I can do - are we checking checking for hangs/assertions or comparing >> patched vs unpatched performance (for the metapage patch)? >> >> The first step is to find any bugs and then do performance testing. What I wanted for performance testing was to compare HEAD against all three patches (all three together) of Hash Index (concurrent hash index, cache the meta page, WAL for hash index). For Head, we want two set of numbers, one with hash indexes and another with btree indexes. As Jeff has found few problems, I think it is better to first fix those before going for performance tests. > > So, assuming the latter - testing performance with and without the metapage > patch: > > For my 1st runs: > > - cpus 16, ran 16G > - size 100, clients 32 > > I'm seeing no difference in performance for read only (-S) pgbench workload > (with everybody using has indexes). I guess not that surprising as the db > fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset > (suspect size 2000 is needed). > I think you should try with -S -M prepared and with various client counts (8,16,32,64,100...). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Sep 11, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think these symptoms indicate the bug in concurrent hash indexOn Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>>
>> I plan to do testing using my own testing harness after changing it to
>> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
>> column, which are never queried by the core part of the harness) and
>> deleting them at random intervals. I think that none of pgbench's built in
>> tests are likely to give the bucket splitting and squeezing code very much
>> exercise.
>
>
>
> I've implemented this, by adding lines 197 through 202 to the count.pl
> script. (I'm reattaching the test case)
>
> Within a few minutes of testing, I start getting Errors like these:
>
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not
> owned by resource owner Portal
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set
> count=count+1 where index=$1
>
>
> In one test, I also got an error from my test harness itself indicating
> tuples are transiently missing from the index, starting an hour into a test:
>
> child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> count.pl line 194.\n at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> count.pl line 194.\n at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> count.pl line 194.\n at count.pl line 208.
>
> Those key values should always find exactly one row to update.
>
> If the tuples were permanently missing from the index, I would keep getting
> errors on the same key values very frequently. But I don't get that, the
> errors remain infrequent and are on different value each time, so I think
> the tuples are in the index but the scan somehow misses them, either while
> the bucket is being split or while it is being squeezed.
>
> This on a build without enable-asserts.
>
> Any ideas on how best to go about investigating this?
>
patch, but it could be that the problem can be only revealed with WAL
patch. Is it possible to just try this with concurrent hash index
patch? In any case, thanks for testing it, I will look into these
issues.
My test program (as posted) injects crashes and then checks the post-crash-recovery system for consistency, so it cannot be run as-is without the WAL patch. I also ran the test with crashing turned off (just change the JJ* variables at the stop of the do.sh to all be set to the empty string), and in that case I didn't see either problem, but it it could just be that I that I didn't run it long enough.
It should have been long enough to detect the rather common "buffer <x> is not owned by resource owner Portal" problem, so that one I think is specific to the WAL patch (probably the part which tries to complete bucket splits when it detects one was started but not completed?)
Cheers,
Jeff
Hi, On 09/07/2016 05:58 AM, Amit Kapila wrote: > Okay, I have fixed this issue as explained above. Apart from that, I > have fixed another issue reported by Mark Kirkwood upthread and few > other issues found during internal testing by Ashutosh Sharma. > > The locking issue reported by Mark and Ashutosh is that the patch > didn't maintain the locking order while adding overflow page as it > maintains in other write operations (lock the bucket pages first and > then metapage to perform the write operation). I have added the > comments in _hash_addovflpage() to explain the locking order used in > modified patch. > > During stress testing with pgbench using master-standby setup, we > found an issue which indicates that WAL replay machinery doesn't > expect completely zeroed pages (See explanation of RBM_NORMAL mode > atop XLogReadBufferExtended). Previously before freeing the overflow > page we were zeroing it, now I have changed it to just initialize the > page such that the page will be empty. > > Apart from above, I have added support for old snapshot threshold in > the hash index code. > > Thanks to Ashutosh Sharma for doing the testing of the patch and > helping me in analyzing some of the above issues. > > I forgot to mention in my initial mail that Robert and I had some > off-list discussions about the design of this patch, many thanks to > him for providing inputs. > Some initial feedback. README: +in_complete split flag. The reader algorithm works correctly, as it will scan What flag ? hashxlog.c: hash_xlog_move_page_contents hash_xlog_squeeze_page Both have "bukcetbuf" (-> "bucketbuf"), and + if (BufferIsValid(bukcetbuf)); -> + if (BufferIsValid(bucketbuf)) and indent following line: LockBufferForCleanup(bukcetbuf); hash_xlog_delete has the "if" issue too. hash.h: Move the XLog related content to hash_xlog.h Best regards, Jesper
On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > > My test program (as posted) injects crashes and then checks the > post-crash-recovery system for consistency, so it cannot be run as-is > without the WAL patch. I also ran the test with crashing turned off (just > change the JJ* variables at the stop of the do.sh to all be set to the empty > string), and in that case I didn't see either problem, but it it could just > be that I that I didn't run it long enough. > > It should have been long enough to detect the rather common "buffer <x> is > not owned by resource owner Portal" problem, so that one I think is specific > to the WAL patch (probably the part which tries to complete bucket splits > when it detects one was started but not completed?) > Both the problem seems to be due to same reason and in concurrent hash index patch. So what is happening here is that we are trying to unpin the old bucket buffer twice. We need to scan the old bucket when there is a incomplete-split, so the issue here is that during split the system has crashed and after restart, during old bucket scan it tries to unpin the old primary bucket buffer twice when the new bucket has additional overflow pages. I will post the updated patch on concurrent hash index thread. Once, I post the patch, it is better if you try to reproduce the issue once more. Thanks to Ashutosh Sharma who has offlist shared the call stack after reproducing the problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 12, 2016 at 10:24 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > Some initial feedback. > Thanks for looking into patch. > README: > +in_complete split flag. The reader algorithm works correctly, as it will > scan > > What flag ? > in-complete-split flag which indicates that split has to be finished for that particular bucket. The value of these flags are LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old bucket respectively. It is set in hasho_flag in special area of page. I have slightly expanded the definition in README, but not sure if it is good idea to mention flags defined in hash.h. Let me know, if still it is unclear or you want some thing additional to be added in README. > hashxlog.c: > > hash_xlog_move_page_contents > hash_xlog_squeeze_page > > Both have "bukcetbuf" (-> "bucketbuf"), and > > + if (BufferIsValid(bukcetbuf)); > > -> > > + if (BufferIsValid(bucketbuf)) > > and indent following line: > > LockBufferForCleanup(bukcetbuf); > > hash_xlog_delete > > has the "if" issue too. > Fixed all the above cosmetic issues. > hash.h: > > Move the XLog related content to hash_xlog.h > Moved and renamed hashxlog.c to hash_xlog.c to match the name with corresponding .h file. Note - This patch is to be applied on top of concurrent hash index v6 version [1]. Jeff, use this version of patch for further review and test. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BERbP%2B7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On 09/13/2016 07:41 AM, Amit Kapila wrote: >> README: >> +in_complete split flag. The reader algorithm works correctly, as it will >> scan >> >> What flag ? >> > > in-complete-split flag which indicates that split has to be finished > for that particular bucket. The value of these flags are > LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old > bucket respectively. It is set in hasho_flag in special area of page. > I have slightly expanded the definition in README, but not sure if it > is good idea to mention flags defined in hash.h. Let me know, if still > it is unclear or you want some thing additional to be added in README. > I think it is better now. >> hashxlog.c: >> >> hash_xlog_move_page_contents >> hash_xlog_squeeze_page >> >> Both have "bukcetbuf" (-> "bucketbuf"), and >> >> + if (BufferIsValid(bukcetbuf)); >> >> -> >> >> + if (BufferIsValid(bucketbuf)) >> >> and indent following line: >> >> LockBufferForCleanup(bukcetbuf); >> >> hash_xlog_delete >> >> has the "if" issue too. >> > > Fixed all the above cosmetic issues. > I meant there is an extra ';' on the "if" statements: + if (BufferIsValid(bukcetbuf)); <-- in hash_xlog_move_page_contents, hash_xlog_squeeze_page and hash_xlog_delete. Best regards, Jesper
Hi All, I am getting following error when running the test script shared by Jeff -[1] . The error is observed upon executing the test script for around 3-4 hrs. 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock buffer_content 1 is not held 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,10000) 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock buffer_content 10 is not held 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,10000) 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock buffer_content 10 is not held 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,10000) [1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com Please note that i am performing the test on latest patch for concurrent hash index and WAL log in hash index shared by Amit yesterday. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > On 09/13/2016 07:41 AM, Amit Kapila wrote: >>> >>> README: >>> +in_complete split flag. The reader algorithm works correctly, as it >>> will >>> scan >>> >>> What flag ? >>> >> >> in-complete-split flag which indicates that split has to be finished >> for that particular bucket. The value of these flags are >> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old >> bucket respectively. It is set in hasho_flag in special area of page. >> I have slightly expanded the definition in README, but not sure if it >> is good idea to mention flags defined in hash.h. Let me know, if still >> it is unclear or you want some thing additional to be added in README. >> > > I think it is better now. > >>> hashxlog.c: >>> >>> hash_xlog_move_page_contents >>> hash_xlog_squeeze_page >>> >>> Both have "bukcetbuf" (-> "bucketbuf"), and >>> >>> + if (BufferIsValid(bukcetbuf)); >>> >>> -> >>> >>> + if (BufferIsValid(bucketbuf)) >>> >>> and indent following line: >>> >>> LockBufferForCleanup(bukcetbuf); >>> >>> hash_xlog_delete >>> >>> has the "if" issue too. >>> >> >> Fixed all the above cosmetic issues. >> > > I meant there is an extra ';' on the "if" statements: > > + if (BufferIsValid(bukcetbuf)); <-- > > in hash_xlog_move_page_contents, hash_xlog_squeeze_page and > hash_xlog_delete. > > > Best regards, > Jesper > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Hi All, Below is the backtrace for the issue reported in my earlier mail [1]. From the callstack it looks like we are trying to release lock on a meta page twice in _hash_expandtable(). (gdb) bt #0 0x00000000007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799 #1 0x000000000078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540 #2 0x00000000004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0, buf=2, from_access=1, to_access=-1) at hashpage.c:331 #3 0x00000000004a722d in _hash_expandtable (rel=0x7f5605b608c0, metabuf=2) at hashpage.c:995 #4 0x00000000004a316a in _hash_doinsert (rel=0x7f5605b608c0, itup=0x2ba5030) at hashinsert.c:313 #5 0x00000000004a0d85 in hashinsert (rel=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4, heapRel=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at hash.c:248 #6 0x00000000004c5a16 in index_insert (indexRelation=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", heap_t_ctid=0x2c2e4f4, heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204 #7 0x000000000063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0, tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:388 #8 0x000000000066a1fc in ExecInsert (mtstate=0x2b99e50, slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0, onconflict=ONCONFLICT_NONE, estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481 #9 0x000000000066b841 in ExecModifyTable (node=0x2b99e50) at nodeModifyTable.c:1496 #10 0x0000000000645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396 #11 0x00000000006424d9 in ExecutePlan (estate=0x2b99c00, planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT, sendTuples=0 '\000', numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20 <donothingDR>) at execMain.c:1567 #12 0x000000000064087a in standard_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:338 #13 0x00007f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877 #14 0x0000000000640751 in ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:284 #15 0x00000000007c331e in ProcessQuery (plan=0x2b873f0, sourceText=0x2b89bd0 "insert into foo (index) select $1 from generate_series(1,10000)", params=0x2b89c20, dest=0xd73c20 <donothingDR>, completionTag=0x7ffdf540a3f0 "") at pquery.c:187 #16 0x00000000007c4a0c in PortalRunMulti (portal=0x2ae7930, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20 <donothingDR>, altdest=0xd73c20 <donothingDR>, completionTag=0x7ffdf540a3f0 "") at pquery.c:1303 #17 0x00000000007c4055 in PortalRun (portal=0x2ae7930, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90, altdest=0x2b5dc90, completionTag=0x7ffdf540a3f0 "") at pquery.c:815 #18 0x00000000007bfb45 in exec_execute_message (portal_name=0x2b5d880 "", max_rows=9223372036854775807) at postgres.c:1977 #19 0x00000000007c25c7 in PostgresMain (argc=1, argv=0x2b05df8, dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at postgres.c:4133 #20 0x0000000000744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260 #21 0x0000000000744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934 #22 0x0000000000740f9c in ServerLoop () at postmaster.c:1691 #23 0x0000000000740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at postmaster.c:1299 #24 0x00000000006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228 Please let me know for any further inputs. [1]- https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Hi All, > > I am getting following error when running the test script shared by > Jeff -[1] . The error is observed upon executing the test script for > around 3-4 hrs. > > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock > buffer_content 1 is not held > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into > foo (index) select $1 from generate_series(1,10000) > > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock > buffer_content 10 is not held > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,10000) > > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock > buffer_content 10 is not held > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,10000) > > [1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com > > Please note that i am performing the test on latest patch for > concurrent hash index and WAL log in hash index shared by Amit > yesterday. > > > With Regards, > Ashutosh Sharma > EnterpriseDB: http://www.enterprisedb.com > > > > On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> On 09/13/2016 07:41 AM, Amit Kapila wrote: >>>> >>>> README: >>>> +in_complete split flag. The reader algorithm works correctly, as it >>>> will >>>> scan >>>> >>>> What flag ? >>>> >>> >>> in-complete-split flag which indicates that split has to be finished >>> for that particular bucket. The value of these flags are >>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old >>> bucket respectively. It is set in hasho_flag in special area of page. >>> I have slightly expanded the definition in README, but not sure if it >>> is good idea to mention flags defined in hash.h. Let me know, if still >>> it is unclear or you want some thing additional to be added in README. >>> >> >> I think it is better now. >> >>>> hashxlog.c: >>>> >>>> hash_xlog_move_page_contents >>>> hash_xlog_squeeze_page >>>> >>>> Both have "bukcetbuf" (-> "bucketbuf"), and >>>> >>>> + if (BufferIsValid(bukcetbuf)); >>>> >>>> -> >>>> >>>> + if (BufferIsValid(bucketbuf)) >>>> >>>> and indent following line: >>>> >>>> LockBufferForCleanup(bukcetbuf); >>>> >>>> hash_xlog_delete >>>> >>>> has the "if" issue too. >>>> >>> >>> Fixed all the above cosmetic issues. >>> >> >> I meant there is an extra ';' on the "if" statements: >> >> + if (BufferIsValid(bukcetbuf)); <-- >> >> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and >> hash_xlog_delete. >> >> >> Best regards, >> Jesper >> >> >> >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Hi All, > > Below is the backtrace for the issue reported in my earlier mail [1]. > From the callstack it looks like we are trying to release lock on a > meta page twice in _hash_expandtable(). > Thanks for the call stack. I think below code in patch is culprit. Here we have already released the meta page lock and then again on failure, we are trying to release it. _hash_expandtable() { .. /* Release the metapage lock, before completing the split. */ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); .. if (!buf_nblkno) { _hash_relbuf(rel, buf_oblkno); goto fail; } .. fail: /* We didn't write the metapage, so just drop lock */ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); } This is a problem of concurrent hash index patch. I will send the fix in next version of the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > On 09/13/2016 07:41 AM, Amit Kapila wrote: >>> > > I meant there is an extra ';' on the "if" statements: > > + if (BufferIsValid(bukcetbuf)); <-- > > in hash_xlog_move_page_contents, hash_xlog_squeeze_page and > hash_xlog_delete. > Okay, Thanks for pointing out the same. I have fixed it. Apart from that, I have changed _hash_alloc_buckets() to initialize the page instead of making it completely Zero because of problems discussed in another related thread [1]. I have also updated README. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BzT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.
with v7 of the concurrent has patch and v4 of the write ahead log patch and the latest relcache patch (I don't know how important that is to reproducing this, I suspect it is not), I once got this error:
38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was interrupted; last known up at 2016-09-19 16:25:49 PDT
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not properly shut down; automatic recovery in progress
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed, calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of relation base/16384/17334
38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
The original page with the invalid checksum is:
$ od 16384_17334_9
0000000 000032 000000 015420 077347 020404 000000 000030 017760
0000020 017760 020004 000000 000000 000000 000000 000000 000000
0000040 000000 000000 000000 000000 000000 000000 000000 000000
*
0017760 000007 000002 177777 177777 000007 000000 000002 177600
0020000
If I ignore the checksum failure and re-start the system, the page gets restored to be a bitmap page.
Cheers,
Jeff
On 09/16/2016 02:42 AM, Amit Kapila wrote: > Okay, Thanks for pointing out the same. I have fixed it. Apart from > that, I have changed _hash_alloc_buckets() to initialize the page > instead of making it completely Zero because of problems discussed in > another related thread [1]. I have also updated README. > Thanks. This needs a rebase against the CHI v8 [1] patch. [1] https://www.postgresql.org/message-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com Best regards, Jesper
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> >> Okay, Thanks for pointing out the same. I have fixed it. Apart from >> that, I have changed _hash_alloc_buckets() to initialize the page >> instead of making it completely Zero because of problems discussed in >> another related thread [1]. I have also updated README. >> > > with v7 of the concurrent has patch and v4 of the write ahead log patch and > the latest relcache patch (I don't know how important that is to reproducing > this, I suspect it is not), I once got this error: > > > 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was > interrupted; last known up at 2016-09-19 16:25:49 PDT > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not > properly shut down; automatic recovery in progress > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90 > 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed, > calculated checksum 65067 but expected 21260 > 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50 > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T > 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of > relation base/16384/17334 > 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50 > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T > > > The original page with the invalid checksum is: > I think this is a example of torn page problem, which seems to be happening because of the below code in your test. ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) { ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3); ! ereport(FATAL, ! (errcode(ERRCODE_DISK_FULL), ! errmsg("could not write block %u of relation %s: wrote only %d of %d bytes", ! blocknum, ! relpath(reln->smgr_rnode, forknum), ! nbytes, BLCKSZ), ! errhint("JJ is screwing with the database."))); ! } else { ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ); ! } If you are running the above test by disabling JJ_torn_page, then it is a different matter and we need to investigate it, but l assume you are running by enabling it. I think this could happen if the actual change in page is in 2/3 part of page which you are not writing in above code. The checksum in page header which is written as part of partial page write (1/3 part of page) would have considered the actual change you have made whereas after restart when it again read the page to apply redo, the checksum calculation won't include the change being made in 2/3 part. Today, Ashutosh has shared the logs of his test run where he has shown similar problem for HEAP page. I think this could happen though rarely for any page with the above kind of tests. Does this explanation explains the reason of problem you are seeing? > > If I ignore the checksum failure and re-start the system, the page gets > restored to be a bitmap page. > Okay, but have you ensured in some way that redo is applied to bitmap page? Today, while thinking on this problem, I realized that currently in patch we are using REGBUF_NO_IMAGE for bitmap page for one of the problem reported by you [1]. That change will fix the problem reported by you, but it will expose bitmap pages for torn-page hazards. I think the right fix there is to make pd_lower equal to pd_upper for bitmap page, so that full page writes doesn't exclude the data in bitmappage. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same. I have fixed it. Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1]. I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
> properly shut down; automatic recovery in progress
> 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
> 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed,
> calculated checksum 65067 but expected 21260
> 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of
> relation base/16384/17334
> 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>
I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.
! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }
If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.
I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code. The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.
Correct. But any torn page write must be covered by the restoration of a full page image during replay, shouldn't it? And that restoration should happen blindly, without first reading in the old page and verifying the checksum. Failure to restore the page from a FPI would be a bug. (That was the purpose for which I wrote this testing harness in the first place, to verify that the restoration of FPI happens correctly; although most of the bugs it happens to uncover have been unrelated to that.)
Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page. I think this could happen though
rarely for any page with the above kind of tests.
I think Ashutosh's examples are of warnings, not errors. I think the warnings occur when replay needs to read in the block (for reason's I don't understand yet) but then doesn't care if it passes the checksum or not because it will just be blown away by the replay anyway.
Does this explanation explains the reason of problem you are seeing?
If it can't survive artificial torn page writes, then it probably can't survive reals ones either. So I am pretty sure it is a bug of some sort. Perhaps the bug is that it is generating an ERROR when should just be a WARNING?
>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>
Okay, but have you ensured in some way that redo is applied to bitmap page?
I haven't done that yet. I can't start the system without destroying the evidence, and I haven't figured out yet how to import a specific block from a shut-down server into a bytea of a running server, in order to inspect it using pageinspect.
Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1]. That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards. I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.
I'm afraid that is over my head. I can study it until it makes sense, but it will take me a while.
Cheers,
Jeff
[1] - https://www.postgresql.org/message-id/ CAA4eK1KJOfVvFUmi6dcX9Y2- 0PFHkomDzGuyoC%3DaD3Qj9WPpFA% 40mail.gmail.com
On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> >> > wrote: >> >> >> >> >> >> Okay, Thanks for pointing out the same. I have fixed it. Apart from >> >> that, I have changed _hash_alloc_buckets() to initialize the page >> >> instead of making it completely Zero because of problems discussed in >> >> another related thread [1]. I have also updated README. >> >> >> > >> > with v7 of the concurrent has patch and v4 of the write ahead log patch >> > and >> > the latest relcache patch (I don't know how important that is to >> > reproducing >> > this, I suspect it is not), I once got this error: >> > >> > >> > 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was >> > interrupted; last known up at 2016-09-19 16:25:49 PDT >> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not >> > properly shut down; automatic recovery in progress >> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at >> > 3F/2200DE90 >> > 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification >> > failed, >> > calculated checksum 65067 but expected 21260 >> > 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at >> > 3F/22053B50 >> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T >> > 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 >> > of >> > relation base/16384/17334 >> > 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at >> > 3F/22053B50 >> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T >> > >> > >> > The original page with the invalid checksum is: >> > >> >> I think this is a example of torn page problem, which seems to be >> happening because of the below code in your test. >> >> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page && >> !RecoveryInProgress()) { >> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3); >> ! ereport(FATAL, >> ! (errcode(ERRCODE_DISK_FULL), >> ! errmsg("could not write block %u of relation %s: wrote only %d of %d >> bytes", >> ! blocknum, >> ! relpath(reln->smgr_rnode, forknum), >> ! nbytes, BLCKSZ), >> ! errhint("JJ is screwing with the database."))); >> ! } else { >> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ); >> ! } >> >> If you are running the above test by disabling JJ_torn_page, then it >> is a different matter and we need to investigate it, but l assume you >> are running by enabling it. >> >> I think this could happen if the actual change in page is in 2/3 part >> of page which you are not writing in above code. The checksum in page >> header which is written as part of partial page write (1/3 part of >> page) would have considered the actual change you have made whereas >> after restart when it again read the page to apply redo, the checksum >> calculation won't include the change being made in 2/3 part. > > > Correct. But any torn page write must be covered by the restoration of a > full page image during replay, shouldn't it? And that restoration should > happen blindly, without first reading in the old page and verifying the > checksum. > Probably, but I think this is not currently the way it is handled and I don't want to change it. AFAIU, what happens now is that we first read the old page (and we do page verification while reading old page and display *warning* if checksum doesn't match) and then restore the image. The relevant code path is XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified(). Now, here the point is that why instead of WARNING we are seeing FATAL for the bitmap page of hash index. The reason as explained in my previous e-mail is that for bitmap page we are not logging full page image and we should fix that as explained there. Once the full page image is logged, then first time it reads the torn page, it will use flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing to WARNING. I think this is the reason why Ashutosh is seeing WARNING for heap page whereas you are seeing FATAL for bitmap page of hash index. I don't think we should try to avoid WARNING for torn pages as part of this patch, even if that is better course of action, but certainly FATAL should be fixed and a WARNING should be displayed instead as it is displayed for heap pages. > >> Today, while thinking on this problem, I realized that currently in >> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the >> problem reported by you [1]. That change will fix the problem >> reported by you, but it will expose bitmap pages for torn-page >> hazards. I think the right fix there is to make pd_lower equal to >> pd_upper for bitmap page, so that full page writes doesn't exclude the >> data in bitmappage. > > > I'm afraid that is over my head. I can study it until it makes sense, but > it will take me a while. > Does the explanation above helps, if not, then tell me, I will try to explain it once more. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> >> Correct. But any torn page write must be covered by the restoration of a >> full page image during replay, shouldn't it? And that restoration should >> happen blindly, without first reading in the old page and verifying the >> checksum. >> > > Probably, but I think this is not currently the way it is handled and > I don't want to change it. AFAIU, what happens now is that we first > read the old page (and we do page verification while reading old page > and display *warning* if checksum doesn't match) and then restore the > image. The relevant code path is > XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified(). > > Now, here the point is that why instead of WARNING we are seeing FATAL > for the bitmap page of hash index. The reason as explained in my > previous e-mail is that for bitmap page we are not logging full page > image and we should fix that as explained there. Once the full page > image is logged, then first time it reads the torn page, it will use > flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing > to WARNING. > I think here I am slightly wrong. For the full page writes, it do use RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not doing page verification check and rather blindly setting the page to zero and then overwrites it with full page image. So after my fix, you will not see the error of checksum failure. I have a fix ready, but still doing some more verification. If everything passes, I will share the patch in a day or so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think here I am slightly wrong. For the full page writes, it do use > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not > doing page verification check and rather blindly setting the page to > zero and then overwrites it with full page image. So after my fix, > you will not see the error of checksum failure. I have a fix ready, > but still doing some more verification. If everything passes, I will > share the patch in a day or so. > Attached patch fixes the problem, now we do perform full page writes for bitmap pages. Apart from that, I have rebased the patch based on latest concurrent index patch [1]. I have updated the README as well to reflect the WAL logging related information for different operations. With attached patch, all the review comments or issues found till now are addressed. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> I think here I am slightly wrong. For the full page writes, it do use >> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not >> doing page verification check and rather blindly setting the page to >> zero and then overwrites it with full page image. So after my fix, >> you will not see the error of checksum failure. I have a fix ready, >> but still doing some more verification. If everything passes, I will >> share the patch in a day or so. >> > > Attached patch fixes the problem, now we do perform full page writes > for bitmap pages. Apart from that, I have rebased the patch based on > latest concurrent index patch [1]. I have updated the README as well > to reflect the WAL logging related information for different > operations. > > With attached patch, all the review comments or issues found till now > are addressed. > I forgot to mention that Ashutosh has tested this patch for a day using Jeff's tool and he didn't found any problem. Also, he has found a way to easily reproduce the problem. Ashutosh, can you share your changes to the script using which you have reproduce the problem? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi All, > I forgot to mention that Ashutosh has tested this patch for a day > using Jeff's tool and he didn't found any problem. Also, he has found > a way to easily reproduce the problem. Ashutosh, can you share your > changes to the script using which you have reproduce the problem? I made slight changes in Jeff Janes patch (crash_REL10.patch) to reproduce the issue reported by him earlier [1]. I changed the crash_REL10.patch such that a torn page write happens only for a bitmap page in hash index. As described by Amit in [2] & [3] we were not logging full page image for bitmap page and that's the reason we were not be able to restore bitmap page in case it is a torn page which would eventually result in below error. 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed, calculated checksum 65067 but expected 21260 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of relation base/16384/17334 Jeff Jane's original patch had following conditions for a torn page write: if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3); Ashutosh has added some more condition's in above if( ) to ensure that the issue associated with bitmap page gets reproduced easily: if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID && bucket_opaque->hasho_flag & LH_BITMAP_PAGE) nbytes = FileWrite(v->mdfd_vfd, buffer, 24); Also, please note that i am allowing only 24 bytes of data i.e. just a page header to be written into a disk file so that even if a single overflow page is added into hash index table the issue will be observed. Note: You can find Jeff Jane's patch for torn page write at - [4]. [1] - https://www.postgresql.org/message-id/CAMkU%3D1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN%2BXQw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr%3DAdKbTEg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAA4eK1%2Bk9tGPw7vOACRo%2B4h5n%3DutHnSuGgMoJrLANybAjNBW9w%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com
On 09/25/2016 01:00 AM, Amit Kapila wrote: > Attached patch fixes the problem, now we do perform full page writes > for bitmap pages. Apart from that, I have rebased the patch based on > latest concurrent index patch [1]. I have updated the README as well > to reflect the WAL logging related information for different > operations. > > With attached patch, all the review comments or issues found till now > are addressed. > I have been running various tests, and applications with this patch together with the CHI v8 patch [1]. As I havn't seen any failures and doesn't currently have additional feedback I'm moving this patch to "Ready for Committer" for their feedback. If others have comments, move the patch status back in the CommitFest application, please. [1] https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com Best regards, Jesper
On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think here I am slightly wrong. For the full page writes, it do use
> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
> doing page verification check and rather blindly setting the page to
> zero and then overwrites it with full page image. So after my fix,
> you will not see the error of checksum failure. I have a fix ready,
> but still doing some more verification. If everything passes, I will
> share the patch in a day or so.
>
Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]. I have updated the README as well
to reflect the WAL logging related information for different
operations.
With attached patch, all the review comments or issues found till now
are addressed.
This needs to be updated to apply over concurrent_hash_index_v10.patch.
Unless we want to wait until that work is committed before doing more review and testing on this.
Thanks,
Jeff
On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > I think here I am slightly wrong. For the full page writes, it do use >> > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not >> > doing page verification check and rather blindly setting the page to >> > zero and then overwrites it with full page image. So after my fix, >> > you will not see the error of checksum failure. I have a fix ready, >> > but still doing some more verification. If everything passes, I will >> > share the patch in a day or so. >> > >> >> Attached patch fixes the problem, now we do perform full page writes >> for bitmap pages. Apart from that, I have rebased the patch based on >> latest concurrent index patch [1]. I have updated the README as well >> to reflect the WAL logging related information for different >> operations. >> >> With attached patch, all the review comments or issues found till now >> are addressed. > > > This needs to be updated to apply over concurrent_hash_index_v10.patch. > > Unless we want to wait until that work is committed before doing more review > and testing on this. > The concurrent hash index patch is getting changed and some of the changes needs change in this patch as well. So, I think after it gets somewhat stabilized, I will update this patch as well. I am not sure if it is good idea to update it with every version of hash index. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> Unless we want to wait until that work is committed before doing more review >> and testing on this. >> > > The concurrent hash index patch is getting changed and some of the > changes needs change in this patch as well. So, I think after it gets > somewhat stabilized, I will update this patch as well. > Now that concurrent hash index patch is committed [1], I will work on rebasing this patch. Note, I have moved this to next CF. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6d46f4783efe457f74816a75173eb23ed8930020 -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> Unless we want to wait until that work is committed before doing more review >>> and testing on this. >> >> The concurrent hash index patch is getting changed and some of the >> changes needs change in this patch as well. So, I think after it gets >> somewhat stabilized, I will update this patch as well. > > Now that concurrent hash index patch is committed [1], I will work on > rebasing this patch. Note, I have moved this to next CF. Thanks. I am thinking that it might make sense to try to get the "microvacuum support for hash index" and "cache hash index meta page" patches committed before this one, because I'm guessing they are much simpler than this one, and I think therefore that the review of those patches can probably move fairly quickly. Of course, ideally I can also start reviewing this one in the meantime. Does that make sense to you? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 1, 2016 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> Unless we want to wait until that work is committed before doing more review >>>> and testing on this. >>> >>> The concurrent hash index patch is getting changed and some of the >>> changes needs change in this patch as well. So, I think after it gets >>> somewhat stabilized, I will update this patch as well. >> >> Now that concurrent hash index patch is committed [1], I will work on >> rebasing this patch. Note, I have moved this to next CF. > > Thanks. I am thinking that it might make sense to try to get the > "microvacuum support for hash index" and "cache hash index meta page" > patches committed before this one, because I'm guessing they are much > simpler than this one, and I think therefore that the review of those > patches can probably move fairly quickly. > I think it makes sense to move "cache hash index meta page" first, however "microvacuum support for hash index" is based on WAL patch as the action in this patch (delete op) also needs to be logged. One idea could be that we can try to split the patch so that WAL logging can be done as a separate patch, but I am not sure if it is worth. > Of course, ideally I can > also start reviewing this one in the meantime. Does that make sense > to you? > You can start reviewing some of the operations like "Create Index", "Insert". However, some changes are required because of change in locking strategy for Vacuum. I am planning to work on rebasing it and making required changes in next week. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Thanks. I am thinking that it might make sense to try to get the >> "microvacuum support for hash index" and "cache hash index meta page" >> patches committed before this one, because I'm guessing they are much >> simpler than this one, and I think therefore that the review of those >> patches can probably move fairly quickly. > > I think it makes sense to move "cache hash index meta page" first, > however "microvacuum support for hash index" is based on WAL patch as > the action in this patch (delete op) also needs to be logged. One > idea could be that we can try to split the patch so that WAL logging > can be done as a separate patch, but I am not sure if it is worth. The thing is, there's a fair amount locking stupidity in what just got committed because of the requirement that the TID doesn't decrease within a page. I'd really like to get that fixed. >> Of course, ideally I can >> also start reviewing this one in the meantime. Does that make sense >> to you? >> > > You can start reviewing some of the operations like "Create Index", > "Insert". However, some changes are required because of change in > locking strategy for Vacuum. I am planning to work on rebasing it and > making required changes in next week. I'll review after that, since I have other things to review meanwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 2, 2016 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Of course, ideally I can >>> also start reviewing this one in the meantime. Does that make sense >>> to you? >>> >> >> You can start reviewing some of the operations like "Create Index", >> "Insert". However, some changes are required because of change in >> locking strategy for Vacuum. I am planning to work on rebasing it and >> making required changes in next week. > > I'll review after that, since I have other things to review meanwhile. > Attached, please find the rebased patch attached with this e-mail. There is no fundamental change in patch except for adapting the new locking strategy in squeeze operation. I have done the sanity testing of the patch with master-standby setup and Kuntal has helped me to verify it with his wal-consistency checker patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I'll review after that, since I have other things to review meanwhile. > > Attached, please find the rebased patch attached with this e-mail. > There is no fundamental change in patch except for adapting the new > locking strategy in squeeze operation. I have done the sanity testing > of the patch with master-standby setup and Kuntal has helped me to > verify it with his wal-consistency checker patch. This patch again needs a rebase, but before you do that I'd like to make it harder by applying the attached patch to remove _hash_chgbufaccess(), which I think is a bad plan for more or less the same reasons that motivated the removal of _hash_wrtbuf() in commit 25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably more simplification and cleanup that can be done afterward in the wake of this; what I've done here is just a mechanical replacement of _hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty(). The point is that having MarkBufferDirty() calls happen implicitly instead some other function is not what we want for write-ahead logging. Your patch gets rid of that, too; this is just doing it somewhat more thoroughly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I'll review after that, since I have other things to review meanwhile. >> >> Attached, please find the rebased patch attached with this e-mail. >> There is no fundamental change in patch except for adapting the new >> locking strategy in squeeze operation. I have done the sanity testing >> of the patch with master-standby setup and Kuntal has helped me to >> verify it with his wal-consistency checker patch. > > This patch again needs a rebase, > I had completed it yesterday night and kept it for night long tests to check the sanity of the patch, but I guess now I need another rebase. Anyway, I feel this is all for the betterment of final patch. > but before you do that I'd like to > make it harder by applying the attached patch to remove > _hash_chgbufaccess(), which I think is a bad plan for more or less the > same reasons that motivated the removal of _hash_wrtbuf() in commit > 25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably > more simplification and cleanup that can be done afterward in the wake > of this; what I've done here is just a mechanical replacement of > _hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty(). The patch has replaced usage of HASH_READ/HASH_WRITE with BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE which will make hash code using two types of defines for the same purpose. Now, we can say that it is better to replace HASH_READ/HASH_WRITE in whole hash index code or maybe the usage this patch is introducing is okay, however, it seems like btree is using similar terminology (BT_READ/BT_WRITE). Other than that your patch looks good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 23, 2016 at 6:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> I'll review after that, since I have other things to review meanwhile. >>> >>> Attached, please find the rebased patch attached with this e-mail. >>> There is no fundamental change in patch except for adapting the new >>> locking strategy in squeeze operation. I have done the sanity testing >>> of the patch with master-standby setup and Kuntal has helped me to >>> verify it with his wal-consistency checker patch. >> >> This patch again needs a rebase, >> > > I had completed it yesterday night and kept it for night long tests to > check the sanity of the patch, but I guess now I need another rebase. > After recent commit's 7819ba1e and 25216c98, this patch requires a rebase. Attached is the rebased patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 12/27/2016 01:58 AM, Amit Kapila wrote: > After recent commit's 7819ba1e and 25216c98, this patch requires a > rebase. Attached is the rebased patch. > This needs a rebase after commit e898437. Best regards, Jesper
On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > On 12/27/2016 01:58 AM, Amit Kapila wrote: >> >> After recent commit's 7819ba1e and 25216c98, this patch requires a >> rebase. Attached is the rebased patch. >> > > This needs a rebase after commit e898437. > Attached find the rebased patch. Thanks! -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jan 13, 2017 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> On 12/27/2016 01:58 AM, Amit Kapila wrote: >>> >>> After recent commit's 7819ba1e and 25216c98, this patch requires a >>> rebase. Attached is the rebased patch. >>> >> >> This needs a rebase after commit e898437. >> > > Attached find the rebased patch. The patch applies, and there have been no reviews since the last version has been submitted. So moved to CF 2017-03. -- Michael
On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>
>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>> rebase. Attached is the rebased patch.
>>
>
> This needs a rebase after commit e898437.
>
Attached find the rebased patch.
Thanks!
I've put this through a lot of crash-recovery testing using different variations of my previously posted testing harness, and have not had any problems.
I occasionally get deadlocks which are properly detected. I think this is just a user-space problem, not a bug, but will describe it anyway just in case...
What happens is that connections occasionally create 10,000 nuisance tuples all using the same randomly chosen negative integer index value (the one the hash index is on), and then some time later delete those tuples using the memorized negative index value, to force the page split and squeeze to get exercised. If two connections happen to choose the same negative index for their nuisance tuples, and then try to delete "their" tuples at the same time, and they are deleting the same 20,000 "nuisance" tuples concurrently with a bucket split/squeeze/something, they might see the tuples in a different order from each other and so deadlock against each other waiting on each others transaction locks due to "locked" tuples. Is that a plausible explanation?
Cheers,
Jeff
On Sat, Feb 4, 2017 at 4:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen >> <jesper.pedersen@redhat.com> wrote: >> > On 12/27/2016 01:58 AM, Amit Kapila wrote: >> >> >> >> After recent commit's 7819ba1e and 25216c98, this patch requires a >> >> rebase. Attached is the rebased patch. >> >> >> > >> > This needs a rebase after commit e898437. >> > >> >> Attached find the rebased patch. >> >> Thanks! > > > I've put this through a lot of crash-recovery testing using different > variations of my previously posted testing harness, and have not had any > problems. > Thanks for detailed testing of this patch. > I occasionally get deadlocks which are properly detected. I think this is > just a user-space problem, not a bug, but will describe it anyway just in > case... > > What happens is that connections occasionally create 10,000 nuisance tuples > all using the same randomly chosen negative integer index value (the one the > hash index is on), and then some time later delete those tuples using the > memorized negative index value, to force the page split and squeeze to get > exercised. If two connections happen to choose the same negative index for > their nuisance tuples, and then try to delete "their" tuples at the same > time, and they are deleting the same 20,000 "nuisance" tuples concurrently > with a bucket split/squeeze/something, they might see the tuples in a > different order from each other and so deadlock against each other waiting > on each others transaction locks due to "locked" tuples. Is that a > plausible explanation? > Yes, that is right explanation for the deadlock you are seeing. A few days back Jesper and Ashutosh Sharma have also faced the same problem [1] with the similar test (non-unique values in hash index). To explain in brief, how this can happen is that squeeze operation moves tuples from the far end of the bucket to rear end and it is quite possible that different backends can see tuples with same values in a different order. [1] - https://www.postgresql.org/message-id/CAE9k0PmBGrNO_s%2ByKn72VO--ypaXWemOsizsWr5cHjEv0X4ERg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> On 12/27/2016 01:58 AM, Amit Kapila wrote: >>> After recent commit's 7819ba1e and 25216c98, this patch requires a >>> rebase. Attached is the rebased patch. >> >> This needs a rebase after commit e898437. > > Attached find the rebased patch. Well, I've managed to break this again by committing more things. I think it would be helpful if you broke this into a series of preliminary refactoring patches and then a final patch that actually adds WAL-logging. The things that look like preliminary refactoring to me are: - Adding _hash_pgaddmultitup and using it in various places. - Adding and freeing overflow pages has been extensively reworked. - Similarly, there is some refactoring of how bitmap pages get initialized. - Index initialization has been rejiggered significantly. - Bucket splits have been rejiggered. Individually those changes don't look all that troublesome to review, but altogether it's quite a lot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen >> <jesper.pedersen@redhat.com> wrote: >>> On 12/27/2016 01:58 AM, Amit Kapila wrote: >>>> After recent commit's 7819ba1e and 25216c98, this patch requires a >>>> rebase. Attached is the rebased patch. >>> >>> This needs a rebase after commit e898437. >> >> Attached find the rebased patch. > > Well, I've managed to break this again by committing more things. > I have again rebased it, this time it requires few more changes due to the previous commits. As we have to store hashm_maxbucket number in prevblock number of primary bucket page, we need to take of same during replay wherever applicable. Also, there were few new tests in contrib modules which were failing. Attached patch is still a rebased version of the previous patch, I will work on splitting it as per your suggestion below, but I thought it is better to post the rebased patch as well. > I think it would be helpful if you broke this into a series of > preliminary refactoring patches and then a final patch that actually > adds WAL-logging. Okay, that makes sense to me as well. > The things that look like preliminary refactoring to > me are: > > - Adding _hash_pgaddmultitup and using it in various places. > - Adding and freeing overflow pages has been extensively reworked. > Freeing the overflow page is too tightly coupled with changes related to _hash_pgaddmultitup, so it might be better to keep it along with it. However, I think we can prepare a separate patch for changes related to adding the overflow page. > - Similarly, there is some refactoring of how bitmap pages get initialized. > - Index initialization has been rejiggered significantly. > - Bucket splits have been rejiggered. > Will try to prepare separate patches for the above three. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Feb 10, 2017 at 12:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> The things that look like preliminary refactoring to >> me are: >> >> - Adding _hash_pgaddmultitup and using it in various places. >> - Adding and freeing overflow pages has been extensively reworked. >> > > Freeing the overflow page is too tightly coupled with changes related > to _hash_pgaddmultitup, so it might be better to keep it along with > it. However, I think we can prepare a separate patch for changes > related to adding the overflow page. > >> - Similarly, there is some refactoring of how bitmap pages get initialized. >> - Index initialization has been rejiggered significantly. >> - Bucket splits have been rejiggered. >> > As discussed, attached are refactoring patches and a patch to enable WAL for the hash index on top of them. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Expose-a-new-API-_hash_pgaddmultitup-to-add-multiple.patch
- 0002-Expose-hashinitbitmapbuffer-to-initialize-bitmap-pag.patch
- 0003-Restructure-_hash_addovflpage-so-that-the-operation-.patch
- 0004-restructure-split-bucket-code-so-that-the-operation-.patch
- 0005-Restructure-hash-index-creation-and-metapage-initial.patch
- 0006-Enable-WAL-for-Hash-Indexes.patch
On Mon, Feb 13, 2017 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > As discussed, attached are refactoring patches and a patch to enable > WAL for the hash index on top of them. 0006-Enable-WAL-for-Hash-Indexes.patch needs to be rebased after commit 8da9a226369e9ceec7cef1. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > As discussed, attached are refactoring patches and a patch to enable > WAL for the hash index on top of them. Thanks. I think that the refactoring patches shouldn't add START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that for the final patch. Nor should their comments reference the idea of writing WAL; that should also be for the final patch. PageGetFreeSpaceForMulTups doesn't seem like a great name. PageGetFreeSpaceForMultipleTuples? Or maybe just use PageGetExactFreeSpace and then do the account in the caller. I'm not sure it's really worth having a function just to subtract a multiple of sizeof(ItemIdData), and it would actually be more efficient to have the caller take care of this, since you wouldn't need to keep recalculating the value for every iteration of the loop. I think we ought to just rip (as a preliminary patch) out the support for freeing an overflow page in the middle of a bucket chain. That code is impossible to hit, I believe, and instead of complicating the WAL machinery to cater to a nonexistent case, maybe we ought to just get rid of it. + /* + * We need to release and if required reacquire the lock on + * rbuf to ensure that standby shouldn't see an intermediate + * state of it. + */ How does releasing and reacquiring the lock on the master affect whether the standby can see an intermediate state? That sounds totally wrong to me (and also doesn't belong in a refactoring patch anyway since we're not emitting any WAL here). - Assert(PageIsNew(page)); This worries me a bit. What's going on here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> As discussed, attached are refactoring patches and a patch to enable >> WAL for the hash index on top of them. > > Thanks. I think that the refactoring patches shouldn't add > START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that > for the final patch. Nor should their comments reference the idea of > writing WAL; that should also be for the final patch. > Okay, changed as per suggestion. > PageGetFreeSpaceForMulTups doesn't seem like a great name. > PageGetFreeSpaceForMultipleTuples? Okay, changed as per suggestion. > Or maybe just use > PageGetExactFreeSpace and then do the account in the caller. I'm not > sure it's really worth having a function just to subtract a multiple > of sizeof(ItemIdData), and it would actually be more efficient to have > the caller take care of this, since you wouldn't need to keep > recalculating the value for every iteration of the loop. > I have tried this one, but I think even if track outside we need something like PageGetFreeSpaceForMultipleTuples for the cases when we have to try next write page/'s where data (set of index tuples) can be accommodated. > I think we ought to just rip (as a preliminary patch) out the support > for freeing an overflow page in the middle of a bucket chain. That > code is impossible to hit, I believe, and instead of complicating the > WAL machinery to cater to a nonexistent case, maybe we ought to just > get rid of it. > I also could not think of a case where we need to care about the page in the middle of bucket chain as per it's current usage. In particular, are you worried about the code related to nextblock number in _hash_freeovflpage()? Surely, we can remove it, but what complexity are you referring to here? There is some additional book-keeping for primary bucket page, but there is nothing much about maintaining a backward link. One more thing to note is that this is an exposed API, so to avoid getting it used in some wrong way, we might want to make it static. > + /* > + * We need to release and if required > reacquire the lock on > + * rbuf to ensure that standby > shouldn't see an intermediate > + * state of it. > + */ > > How does releasing and reacquiring the lock on the master affect > whether the standby can see an intermediate state? I am talking about the intermediate state of standby with respect to the master. We will release the lock on standby after replaying the WAL for moving tuples from read page to write page whereas master will hold that for the much longer period (till we move all the tuples from read page and free that page). I understand that there is no correctness issue here, but was trying to make operations on master and standby similar and another thing is that it will help us in obeying the coding rule of releasing the lock on buffers after writing WAL record. I see no harm in maintaining the existing coding pattern, however, if you think that is not important in this case, then I can change the code to avoid releasing the lock on read page. > That sounds > totally wrong to me (and also doesn't belong in a refactoring patch > anyway since we're not emitting any WAL here). > Agreed that even if we want to do such a change, then also it should be part of WAL patch. So for now, I have reverted that change. > - Assert(PageIsNew(page)); > > This worries me a bit. What's going on here? > This is related to below change. + /* + * Initialise the freed overflow page, here we can't complete zeroed the + * page as WAL replay routines expect pages to be initialized. See + * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. + */ + _hash_pageinit(ovflpage, BufferGetPageSize (ovflbuf)); Basically, we need this routine to initialize freed overflow pages. Also, if you see the similar function in btree (_bt_pageinit(Page page, Size size)), it doesn't have any such Assertion. Attached are refactoring patches. WAL patch needs some changes based on above comments, so will post it later. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Attached are refactoring patches. WAL patch needs some changes based > on above comments, so will post it later. > Attached is a rebased patch to enable WAL logging for hash indexes. Note, that this needs to be applied on top of refactoring patches [1] sent in the previous e-mail. [1] - https://www.postgresql.org/message-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_%2B-oCRtRkdqPrQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Attached are refactoring patches. WAL patch needs some changes based > on above comments, so will post it later. After some study, I have committed 0001, and also committed 0002 and 0003 as a single commit, with only minor tweaks. I haven't made a full study of 0004 yet, but I'm a bit concerned about a couple of the hunks, specifically this one: @@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel, if (PageGetFreeSpace(npage) < itemsz) { - /* write out nbuf and drop lock, but keep pin */ - MarkBufferDirty(nbuf); + /* drop lock, but keep pin */ LockBuffer(nbuf, BUFFER_LOCK_UNLOCK); + /* chain to a new overflow page */ nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false); npage = BufferGetPage(nbuf); And also this one: @@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel, * To avoid deadlocks due to locking order of buckets, firstlock the old * bucket and then the new bucket. */ - if (nbuf == bucket_nbuf) - { - MarkBufferDirty(bucket_nbuf); - LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); - } - else - { - MarkBufferDirty(nbuf); - _hash_relbuf(rel, nbuf); - } - LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE); opage = BufferGetPage(bucket_obuf); oopaque = (HashPageOpaque)PageGetSpecialPointer(opage); I haven't quite grasped what's going on here, but it looks like those MarkBufferDirty() calls didn't move someplace else, but rather just vanished. That would seem to be not good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 27, 2017 at 11:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Attached are refactoring patches. WAL patch needs some changes based >> on above comments, so will post it later. > > After some study, I have committed 0001, and also committed 0002 and > 0003 as a single commit, with only minor tweaks. > > I haven't made a full study of 0004 yet, but I'm a bit concerned about > a couple of the hunks, specifically this one: > > @@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel, > > if (PageGetFreeSpace(npage) < itemsz) > { > - /* write out nbuf and drop lock, but keep pin */ > - MarkBufferDirty(nbuf); > + /* drop lock, but keep pin */ > LockBuffer(nbuf, BUFFER_LOCK_UNLOCK); > + > /* chain to a new overflow page */ > nbuf = _hash_addovflpage(rel, metabuf, nbuf, > (nbuf == bucket_nbuf) ? true : false); > npage = BufferGetPage(nbuf); > > And also this one: > > @@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel, > * To avoid deadlocks due to locking order of buckets, first lock the old > * bucket and then the new bucket. > */ > - if (nbuf == bucket_nbuf) > - { > - MarkBufferDirty(bucket_nbuf); > - LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); > - } > - else > - { > - MarkBufferDirty(nbuf); > - _hash_relbuf(rel, nbuf); > - } > - > LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE); > opage = BufferGetPage(bucket_obuf); > oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); > > I haven't quite grasped what's going on here, but it looks like those > MarkBufferDirty() calls didn't move someplace else, but rather just > vanished. That would seem to be not good. > Yeah, actually those were added later in Enable-WAL-for-Hash* patch, but I think as this patch is standalone, so we should not remove it from their existing usage, I have added those back and rebased the remaining patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, actually those were added later in Enable-WAL-for-Hash* patch, > but I think as this patch is standalone, so we should not remove it > from their existing usage, I have added those back and rebased the > remaining patches. Great, thanks. 0001 looks good to me now, so committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Yeah, actually those were added later in Enable-WAL-for-Hash* patch, >> but I think as this patch is standalone, so we should not remove it >> from their existing usage, I have added those back and rebased the >> remaining patches. > > Great, thanks. 0001 looks good to me now, so committed. Committed 0002. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 7, 2017 at 5:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch, >>> but I think as this patch is standalone, so we should not remove it >>> from their existing usage, I have added those back and rebased the >>> remaining patches. >> >> Great, thanks. 0001 looks good to me now, so committed. > > Committed 0002. Here are some initial review thoughts on 0003 based on a first read-through. + affected by this setting. Example include system catalogs. For Not grammatical. Perhaps "affected by this setting, such as system catalogs". - mark meta page dirty and release buffer content lock and pin + mark meta page dirty and write WAL for update of metapage + release buffer content lock and pin Was the change in the indentation level intentional? +accomodated in the initial set of buckets and it is not unusual to have large Spelling. +As deletion involves multiple atomic operations, it is quite possible that +system crashes after (a) removing tuples from some of the bucket pages +(b) before clearing the garbage flag (c) before updating the metapage. If the Need two commas and an "or": (a) removing tuples from some of the bucket pages, (b) before clearing the garbage flag, or (c) +quite possible, that system crashes before completing the operation on No comma. Also, "that the system crashes". +the index will remain bloated and can impact performance of read and and this can impact + /* + * we need to release and reacquire the lock on overflow buffer to ensure + * that standby shouldn't see an intermediate state of it. + */ + LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK); + LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE); I still think this is a bad idea. Releasing and reacquiring the lock on the master doesn't prevent the standby from seeing intermediate states; the comment, if I understand correctly, is just plain wrong. What it does do is make the code on the master more complicated, since now the page returned by _hash_addovflpage might theoretically fill up while you are busy releasing and reacquiring the lock. + * Add the tuples (itups) to wbuf in this function, we could do that in the + * caller as well. The advantage of doing it here is we can easily write Change comma to period. Then: "We could easily do that in the caller, but the advantage..." + * Initialise the freed overflow page, here we can't complete zeroed the Don't use British spelling, and don't use a comma to join two sentences. Also "here we can't complete zeroed" doesn't seem right; I don't know what that means. + * To replay meta page changes, we can log the entire metapage which + * doesn't seem advisable considering size of hash metapage or we can + * log the individual updated values which seems doable, but we prefer + * to perform exact operations on metapage during replay as are done + * during actual operation. That looks straight forward and has an + * advantage of using lesser space in WAL. This sounds like it is describing two alternatives that were not selected and then the one that was selected, but I don't understand the difference between the second non-selected alternative and what was actually picked. I'd just write "We need only log the one field from the metapage that has changed." or maybe drop the comment altogether. + XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups); The use of XLR_NORMAL_RDATAS here seems wrong. Presumably you should pass the number of rdatas you will actually need, which is some constant plus nitups. I think you should just write that constant here directly, whether or not it happens to be equal to XLR_NORMAL_RDATAS. + /* + * We need to release and if required reacquire the lock on + * rbuf to ensure that standby shouldn't see an intermediate + * state of it. If we don't release the lock, after replay of + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will be able to + * view the results of partial deletion on rblkno. + */ + LockBuffer(rbuf, BUFFER_LOCK_UNLOCK); If you DO release the lock, this will STILL be true, because what matters on the standby is what the redo code does. This code doesn't even execute on the standby, so how could it possibly affect what intermediate states are visible there? + /* + * Here, we could have copied the entire metapage as well to WAL + * record and restore it as it is during replay. However the size of + * metapage is not small (more than 600 bytes), so recording just the + * information required to construct metapage. + */ Again, this seems sorta obvious. You've got a bunch of these comments that say similar things. I'd either make them shorter or just remove them. + * We can log the exact changes made to meta page, however as no + * concurrent operation could see the index during the replay of this + * record, we can perform the operations during replay as they are + * done here. Don't use a comma to join two sentences. Also, I don't understand anything up to the "however". I think you mean something like "We don't need hash index creation to be atomic from the point of view of replay, because until the creating transaction commits it's not visible to anything on the standby anyway." + * Set pd_lower just past the end of the metadata. This is to log + * full_page_image of metapage in xloginsert.c. Why the underscores? + * won't be a leak on standby, as next split will consume this space. The master and the standby had better be the same, so I don't know what it means to talk about a leak on the standby. + * Change the shared buffer state in critical section, + * otherwise any error could make it unrecoverable after + * recovery. Uh, what? We only recover things during recovery. Nothing gets recovered after recovery. Maybe this could get some tests, via the isolation tester, for things like snapshot too old? I haven't gone through all of this in detail yet; I think most of it is right on target. But there is a lot more to look at yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 8, 2017 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch, >>> but I think as this patch is standalone, so we should not remove it >>> from their existing usage, I have added those back and rebased the >>> remaining patches. >> >> Great, thanks. 0001 looks good to me now, so committed. > > Committed 0002. > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 8, 2017 at 5:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Here are some initial review thoughts on 0003 based on a first read-through. > > > + /* > + * we need to release and reacquire the lock on overflow buffer to ensure > + * that standby shouldn't see an intermediate state of it. > + */ > + LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK); > + LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE); > > I still think this is a bad idea. Releasing and reacquiring the lock > on the master doesn't prevent the standby from seeing intermediate > states; the comment, if I understand correctly, is just plain wrong. > I can remove this release and reacquire stuff, but first, let me try to explain what intermediate state I am talking here. If we don't have a release/reacquire lock here, standby could see an empty overflow page at that location whereas master will be able to see the new page only after the insertion of tuple/'s in that bucket is finished. I understand that there is not much functionality impact of this change but still wanted to make you understand why I have added this code. I am okay with removing this code if you still feel this is just a marginal case and we should not bother about it. Let me know what you think? > > + * Initialise the freed overflow page, here we can't complete zeroed the > > Don't use British spelling, and don't use a comma to join two > sentences. Also "here we can't complete zeroed" doesn't seem right; I > don't know what that means. What the comment means to say is that we can't initialize page as zero (something like below) MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf)); typo in sentence /complete/completely > > + * To replay meta page changes, we can log the entire metapage which > + * doesn't seem advisable considering size of hash metapage or we can > + * log the individual updated values which seems doable, but we prefer > + * to perform exact operations on metapage during replay as are done > + * during actual operation. That looks straight forward and has an > + * advantage of using lesser space in WAL. > > This sounds like it is describing two alternatives that were not > selected and then the one that was selected, but I don't understand > the difference between the second non-selected alternative and what > was actually picked. > Actually, the second alternative was picked. > I'd just write "We need only log the one field > from the metapage that has changed." or maybe drop the comment > altogether. > I have added the comment so that others can understand why we don't want to log the entire meta page (I think in some other indexes, we always log the entire meta page, so such a question can arise). However, after reading your comment, I think it is overkill and I prefer to drop the entire comment unless you feel differently. > > + /* > + * We need to release and if required reacquire the lock on > + * rbuf to ensure that standby shouldn't see an intermediate > + * state of it. If we don't release the lock, after replay of > + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will > be able to > + * view the results of partial deletion on rblkno. > + */ > + LockBuffer(rbuf, BUFFER_LOCK_UNLOCK); > > If you DO release the lock, this will STILL be true, because what > matters on the standby is what the redo code does. > That's right, what I intend to do here is to release the lock in master where it will be released in standby. In this case, we can't ensure what user can see in master is same as in standby after this WAL record is replayed, because in master we have exclusive lock on write buf, so no one can read the contents of read buf (the location of read buf will be after write buf) whereas, in standby, it will be possible to read the contents of the read buf. I think this is not a correctness issue, so we might want to leave as it is, what do you say? > > + * We can log the exact changes made to meta page, however as no > + * concurrent operation could see the index during the replay of this > + * record, we can perform the operations during replay as they are > + * done here. > > Don't use a comma to join two sentences. Also, I don't understand > anything up to the "however". > I mean the below changes made to meta buf (refer existing code): metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1; metap->hashm_nmaps++; I think it will be clear if you look both the DO and REDO operation of XLOG_HASH_INIT_BITMAP_PAGE. Let me know if you still think that the comment needs to be changed? > > + * Set pd_lower just past the end of the metadata. This is to log > + * full_page_image of metapage in xloginsert.c. > > Why the underscores? > > + * won't be a leak on standby, as next split will consume this space. > No specific reason, just trying to resemble full_page_writes, but can change if you feel it doesn't make much sense. > + * won't be a leak on standby, as next split will consume this space. > The master and the standby had better be the same, That's right. > so I don't know > what it means to talk about a leak on the standby. > I just want to say that even if we fail after allocation of buckets and before the split operation is completed, the newly allocated buckets will neither be leaked on master nor on standby. Do you think we should add a comment for same or is it too obvious? How about changing it to something like: "Even if we fail after this operation, it won't leak buckets, as next split will consume this space. In any case, even without failure, we don't use all the space in one split operation." > + * Change the shared buffer state in critical section, > + * otherwise any error could make it unrecoverable after > + * recovery. > > Uh, what? We only recover things during recovery. Nothing gets > recovered after recovery. > Right, how about "Change the shared buffer state in a critical section, otherwise any error could make the page unrecoverable." > Maybe this could get some tests, via the isolation tester, for things > like snapshot too old? > Okay, I can try, but note that currently there is no test related to "snapshot too old" for any other indexes. > I haven't gone through all of this in detail yet; I think most of it > is right on target. But there is a lot more to look at yet. > Thanks for your valuable comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I still think this is a bad idea. Releasing and reacquiring the lock >> on the master doesn't prevent the standby from seeing intermediate >> states; the comment, if I understand correctly, is just plain wrong. > I can remove this release and reacquire stuff, but first, let me try > to explain what intermediate state I am talking here. If we don't > have a release/reacquire lock here, standby could see an empty > overflow page at that location whereas master will be able to see the > new page only after the insertion of tuple/'s in that bucket is > finished. That's true, but you keep speaking as if the release and reacquire prevents the standby from seeing that state. It doesn't. Quite the opposite: it allows the master to see that intermediate state as well. I don't think there is a problem with the standby seeing an intermediate state that can't be seen on the master, provided we've set up the code to handle that intermediate state properly, which hopefully we have. >> + * Initialise the freed overflow page, here we can't complete zeroed the >> >> Don't use British spelling, and don't use a comma to join two >> sentences. Also "here we can't complete zeroed" doesn't seem right; I >> don't know what that means. > > What the comment means to say is that we can't initialize page as zero > (something like below) > MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf)); > > typo in sentence > /complete/completely I think you can just say we initialize the page. Saying we can't initializing it by zeroing it seems pretty obvious, both from the code and from the general theory of how stuff works. If you did want to explain it, I think you'd write something like "Just zeroing the page won't work, because <reasons>." >> + /* >> + * We need to release and if required reacquire the lock on >> + * rbuf to ensure that standby shouldn't see an intermediate >> + * state of it. If we don't release the lock, after replay of >> + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will >> be able to >> + * view the results of partial deletion on rblkno. >> + */ >> + LockBuffer(rbuf, BUFFER_LOCK_UNLOCK); >> >> If you DO release the lock, this will STILL be true, because what >> matters on the standby is what the redo code does. > > That's right, what I intend to do here is to release the lock in > master where it will be released in standby. In this case, we can't > ensure what user can see in master is same as in standby after this > WAL record is replayed, because in master we have exclusive lock on > write buf, so no one can read the contents of read buf (the location > of read buf will be after write buf) whereas, in standby, it will be > possible to read the contents of the read buf. I think this is not a > correctness issue, so we might want to leave as it is, what do you > say? Well, as in the case above, you're doing extra work to make sure that every state which can be observed on the standby can also be observed on the master. But that has no benefit, and it does have a cost: you have to release and reacquire a lock that you could otherwise retain, saving CPU cycles and code complexity. So I think the way you've got it is not good. >> + * We can log the exact changes made to meta page, however as no >> + * concurrent operation could see the index during the replay of this >> + * record, we can perform the operations during replay as they are >> + * done here. >> >> Don't use a comma to join two sentences. Also, I don't understand >> anything up to the "however". >> > > I mean the below changes made to meta buf (refer existing code): > metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1; > > metap->hashm_nmaps++; > > I think it will be clear if you look both the DO and REDO operation of > XLOG_HASH_INIT_BITMAP_PAGE. Let me know if you still think that the > comment needs to be changed? I think it's not very clear. I think what this boils down to is that this is another place where you've got a comment to explain that you didn't log the entire metapage, but I don't think anyone would expect that anyway, so it doesn't seem like a very important thing to explain. If you want a comment here, maybe something like /* This is safe only because nobody else can be modifying the index at this stage; it's only visible to the transaction that is creating it */ >> + * Set pd_lower just past the end of the metadata. This is to log >> + * full_page_image of metapage in xloginsert.c. >> >> Why the underscores? >> >> + * won't be a leak on standby, as next split will consume this space. > > No specific reason, just trying to resemble full_page_writes, but can > change if you feel it doesn't make much sense. Well, usually I don't separate_words_in_a_sentence_with_underscores. It makes sense to do that if you are referring to an identifier in the code with that name, but otehrwise not. >> + * won't be a leak on standby, as next split will consume this space. > >> The master and the standby had better be the same, > > That's right. > >> so I don't know >> what it means to talk about a leak on the standby. >> > > I just want to say that even if we fail after allocation of buckets > and before the split operation is completed, the newly allocated > buckets will neither be leaked on master nor on standby. Do you think > we should add a comment for same or is it too obvious? How about > changing it to something like: > > "Even if we fail after this operation, it won't leak buckets, as next > split will consume this space. In any case, even without failure, we > don't use all the space in one split operation." Sure. >> + * Change the shared buffer state in critical section, >> + * otherwise any error could make it unrecoverable after >> + * recovery. >> >> Uh, what? We only recover things during recovery. Nothing gets >> recovered after recovery. > > Right, how about "Change the shared buffer state in a critical > section, otherwise any error could make the page unrecoverable." Sure. >> Maybe this could get some tests, via the isolation tester, for things >> like snapshot too old? > > Okay, I can try, but note that currently there is no test related to > "snapshot too old" for any other indexes. Wow, that's surprising. It seems the snapshot_too_old test only checks that this works for a table that has no indexes. Have you, anyway, tested it manually? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Great, thanks. 0001 looks good to me now, so committed. >> >> Committed 0002. > > Here are some initial review thoughts on 0003 based on a first read-through. More thoughts on the main patch: The text you've added to the hash index README throws around the term "single WAL entry" pretty freely. For example: "An insertion that causes an addition of an overflow page is logged as a single WAL entry preceded by a WAL entry for new overflow page required to insert a tuple." When you say a single WAL entry, you make it sound like there's only one, but because it's preceded by another WAL entry, there are actually two. I would rephase this to just avoid using the word "single" this way. For example: "If an insertion causes the addition of an overflow page, there will be one WAL entry for the new overflow page and a second entry for the insert itself." +mode anyway). It would seem natural to complete the split in VACUUM, but since +splitting a bucket might require allocating a new page, it might fail if you +run out of disk space. That would be bad during VACUUM - the reason for +running VACUUM in the first place might be that you run out of disk space, +and now VACUUM won't finish because you're out of disk space. In contrast, +an insertion can require enlarging the physical file anyway. But VACUUM actually does try to complete the splits, so this is wrong, I think. +Squeeze operation moves tuples from one of the buckets later in the chain to A squeeze operation +As Squeeze operation involves writing multiple atomic operations, it is As a squeeze operation involves multiple atomic operations + if (!xlrec.is_primary_bucket_page) + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); So, in hashbucketcleanup, the page is registered and therefore an FPI will be taken if this is the first reference to this page since the checkpoint, but hash_xlog_delete won't restore the FPI. I think that exposes us to a torn-page hazard. _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same disease, as does _hash_squeezebucket/hash_xlog_move_page_contents. Not sure exactly what the fix should be. + /* + * As bitmap page doesn't have standard page layout, so this will + * allow us to log the data. + */ + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD); + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32)); If the page doesn't have a standard page layout, why are we passing REGBUF_STANDARD? But I think you've hacked things so that bitmap pages DO have a standard page layout, per the change to _hash_initbitmapbuffer, in which case the comment seems wrong. (The comment is also a little confusing grammatically, but let's iron out the substantive issue first.) + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE); + + PageSetLSN(BufferGetPage(ovflbuf), recptr); + PageSetLSN(BufferGetPage(buf), recptr); + + if (BufferIsValid(mapbuf)) + PageSetLSN(BufferGetPage(mapbuf), recptr); + + if (BufferIsValid(newmapbuf)) + PageSetLSN(BufferGetPage(newmapbuf), recptr); I think you forgot to call PageSetLSN() on metabuf. (There are 5 block references in the write-ahead log record but only 4 calls to PageSetLSN ... surely that can't be right!) + /* + * We need to release the locks once the prev pointer of overflow bucket + * and next of left bucket are set, otherwise concurrent read might skip + * the bucket. + */ + if (BufferIsValid(leftbuf)) + UnlockReleaseBuffer(leftbuf); + UnlockReleaseBuffer(ovflbuf); I don't believe the comment. Postponing the lock release wouldn't cause the concurrent read to skip the bucket. It might cause it to be delayed unnecessarily, but the following comment already addresses that point. I think you can just nuke this comment. + xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false; + xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false; Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar? + /* + * We release locks on writebuf and bucketbuf at end of replay operation + * to ensure that we hold lock on primary bucket page till end of + * operation. We can optimize by releasing the lock on write buffer as + * soon as the operation for same is complete, if it is not same as + * primary bucket page, but that doesn't seem to be worth complicating the + * code. + */ + if (BufferIsValid(writebuf)) + UnlockReleaseBuffer(writebuf); + + if (BufferIsValid(bucketbuf)) + UnlockReleaseBuffer(bucketbuf); Here in hash_xlog_squeeze_page(), you wait until the very end to release these locks, but in e.g. hash_xlog_addovflpage you release them considerably earlier for reasons that seem like they would also be valid here. I think you could move this back to before the updates to the metapage and bitmap page. + /* + * Note: in normal operation, we'd update the metapage while still + * holding lock on the page we inserted into. But during replay it's + * not necessary to hold that lock, since no other index updates can + * be happening concurrently. + */ This comment in hash_xlog_init_bitmap_page seems to be off-base, because we didn't just insert into a page. I'm not disputing that it's safe, though: not only can nobody else be modifying it because we're in recovery, but also nobody else can see it yet; the creating transaction hasn't yet committed. + /* + * Note that we still update the page even if it was restored from a full + * page image, because the bucket flag is not included in the image. + */ Really? Why not? (Because it's part of the special space?) Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to guard against concurrent scans? And also hash_xlog_split_complete? And hash_xlog_delete? If the regular code path is getting a cleanup lock to protect against concurrent scans, recovery better do it, too. + /* + * There is no harm in releasing the lock on old bucket before new bucket + * during replay as no other bucket splits can be happening concurrently. + */ + if (BufferIsValid(oldbuf)) + UnlockReleaseBuffer(oldbuf); And I'm not sure this is safe either. The issue isn't that there might be another bucket split happening concurrently, but that there might be scans that get confused by seeing the bucket split flag set before the new bucket is created or the metapage updated. Seems like it would be safer to keep all the locks for this one. + * Initialise the new bucket page, here we can't complete zeroed the page I think maybe a good way to write these would be "we can't simply zero the page". And Initialise -> Initialize. /* + * Change the shared buffer state in critical section, + * otherwise any error could make it unrecoverable after + * recovery. + */ + START_CRIT_SECTION(); + + /* * Insert tuple on new page, using _hash_pgaddtup to ensure * correctordering by hashkey. This is a tad inefficient * since we may have to shuffle itempointers repeatedly. * Possible future improvement: accumulate all the items for * the new page andqsort them before insertion. */ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); + END_CRIT_SECTION(); No way. You have to start the critical section before making any page modifications and keep it alive until all changes have been logged. hash_redo's mappings of record types to function names is a little less regular than seems desirable. Generally, the function name is the lower-case version of the record type with "hash" and "xlog" switched. But XLOG_HASH_ADD_OVFL_PAGE and XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern. It seems like a good test to do with this patch would be to set up a pgbench test on the master with a hash index replacing the usual btree index. Then, set up a standby and run a read-only pgbench on the standby while a read-write pgbench test runs on the master. Maybe you've already tried something like that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Great, thanks. 0001 looks good to me now, so committed. >>> >>> Committed 0002. >> >> Here are some initial review thoughts on 0003 based on a first read-through. > > More thoughts on the main patch: > > +mode anyway). It would seem natural to complete the split in VACUUM, but since > +splitting a bucket might require allocating a new page, it might fail if you > +run out of disk space. That would be bad during VACUUM - the reason for > +running VACUUM in the first place might be that you run out of disk space, > +and now VACUUM won't finish because you're out of disk space. In contrast, > +an insertion can require enlarging the physical file anyway. > > But VACUUM actually does try to complete the splits, so this is wrong, I think. > Vacuum just tries to clean the remaining tuples from old bucket. Only insert or split operation tries to complete the split. > > + if (!xlrec.is_primary_bucket_page) > + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); > > So, in hashbucketcleanup, the page is registered and therefore an FPI > will be taken if this is the first reference to this page since the > checkpoint, but hash_xlog_delete won't restore the FPI. I think that > exposes us to a torn-page hazard. > _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same > disease, as does _hash_squeezebucket/hash_xlog_move_page_contents. > Right, if we use XLogReadBufferForRedoExtended() instead of XLogReadBufferExtended()/LockBufferForCleanup during relay routine, then it will restore FPI when required and can serve our purpose of acquiring clean up lock on primary bucket page. Yet another way could be to store some information like block number, relfilenode, forknum (maybe we can get away with some less info) of primary bucket in the fixed part of each of these records and then using that read the page and then take a cleanup lock. Now, as in most cases, this buffer needs to be registered only in cases when there are multiple overflow pages, I think having fixed information might hurt in some cases. > > + /* > + * As bitmap page doesn't have standard page layout, so this will > + * allow us to log the data. > + */ > + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD); > + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32)); > > If the page doesn't have a standard page layout, why are we passing > REGBUF_STANDARD? But I think you've hacked things so that bitmap > pages DO have a standard page layout, per the change to > _hash_initbitmapbuffer, in which case the comment seems wrong. > Yes, the comment is obsolete and I have removed it. We need standard page layout for bitmap page, so that full page writes won't exclude the bitmappage data. > > + /* > + * Note that we still update the page even if it was restored from a full > + * page image, because the bucket flag is not included in the image. > + */ > > Really? Why not? (Because it's part of the special space?) > Yes, because it's part of special space. Do you want that to mentioned in comments? > Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to > guard against concurrent scans? > I don't think so, because regular code takes it on old bucket to protect the split from concurrent inserts and cleanup lock on new bucket is just for the sake of consistency. > And also hash_xlog_split_complete? In regular code, we are not taking cleanup lock for this operation. > And hash_xlog_delete? If the regular code path is getting a cleanup > lock to protect against concurrent scans, recovery better do it, too. > We are already taking cleanup lock for this, not sure what exactly is missed here, can you be more specific? > + /* > + * There is no harm in releasing the lock on old bucket before new bucket > + * during replay as no other bucket splits can be happening concurrently. > + */ > + if (BufferIsValid(oldbuf)) > + UnlockReleaseBuffer(oldbuf); > > And I'm not sure this is safe either. The issue isn't that there > might be another bucket split happening concurrently, but that there > might be scans that get confused by seeing the bucket split flag set > before the new bucket is created or the metapage updated. > During scan we check for bucket being populated, so this should be safe. > Seems like > it would be safer to keep all the locks for this one. > If you feel better that way, I can change it and add a comment saying something like "we could release lock on bucket being split early as well, but doing here to be consistent with normal operation" > > It seems like a good test to do with this patch would be to set up a > pgbench test on the master with a hash index replacing the usual btree > index. Then, set up a standby and run a read-only pgbench on the > standby while a read-write pgbench test runs on the master. Maybe > you've already tried something like that? > Something similar has been done, but will again do with the final version. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 9, 2017 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Okay, I can try, but note that currently there is no test related to >> "snapshot too old" for any other indexes. > > Wow, that's surprising. It seems the snapshot_too_old test only > checks that this works for a table that has no indexes. Have you, > anyway, tested it manually? > Yes, I have tested in manually. I think we need to ensure that the modified tuple falls on the same page as old tuple to make the test work. The slight difficulty with the index is to ensure the modified tuple to be inserted into same page as old tuple, this is more true with hash indexes. Also, for heap, I think it relies on hot pruning stuff and for index we need to perform manual vacuum. Basically, if we want we can write a test for index, but not sure if it is worth the pain to write for hash index when the test for btree is not there. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> +mode anyway). It would seem natural to complete the split in VACUUM, but since >> +splitting a bucket might require allocating a new page, it might fail if you >> +run out of disk space. That would be bad during VACUUM - the reason for >> +running VACUUM in the first place might be that you run out of disk space, >> +and now VACUUM won't finish because you're out of disk space. In contrast, >> +an insertion can require enlarging the physical file anyway. >> >> But VACUUM actually does try to complete the splits, so this is wrong, I think. > > Vacuum just tries to clean the remaining tuples from old bucket. Only > insert or split operation tries to complete the split. Oh, right. Sorry. >> + if (!xlrec.is_primary_bucket_page) >> + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); >> >> So, in hashbucketcleanup, the page is registered and therefore an FPI >> will be taken if this is the first reference to this page since the >> checkpoint, but hash_xlog_delete won't restore the FPI. I think that >> exposes us to a torn-page hazard. >> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same >> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents. > > Right, if we use XLogReadBufferForRedoExtended() instead of > XLogReadBufferExtended()/LockBufferForCleanup during relay routine, > then it will restore FPI when required and can serve our purpose of > acquiring clean up lock on primary bucket page. Yet another way could > be to store some information like block number, relfilenode, forknum > (maybe we can get away with some less info) of primary bucket in the > fixed part of each of these records and then using that read the page > and then take a cleanup lock. Now, as in most cases, this buffer > needs to be registered only in cases when there are multiple overflow > pages, I think having fixed information might hurt in some cases. Just because the WAL record gets larger? I think you could arrange to record the data only in the cases where it is needed, but I'm also not sure it would matter that much anyway. Off-hand, it seems better than having to mark the primary bucket page dirty (because you have to set the LSN) every time any page in the chain is modified. >> + /* >> + * Note that we still update the page even if it was restored from a full >> + * page image, because the bucket flag is not included in the image. >> + */ >> >> Really? Why not? (Because it's part of the special space?) > > Yes, because it's part of special space. Do you want that to > mentioned in comments? Seems like a good idea. Maybe just change the end to "because the special space is not included in the image" to make it slightly more explicit. >> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to >> guard against concurrent scans? > > I don't think so, because regular code takes it on old bucket to > protect the split from concurrent inserts and cleanup lock on new > bucket is just for the sake of consistency. You might be right, but this definitely does create a state on the standby that can't occur on the master: the bucket being split is flagged as such, but the bucket being populated doesn't appear to exist yet. Maybe that's OK. It makes me slightly nervous to have the standby take a weaker lock than the master does, but perhaps it's harmless. >> And also hash_xlog_split_complete? > > In regular code, we are not taking cleanup lock for this operation. You're right, sorry. >> And hash_xlog_delete? If the regular code path is getting a cleanup >> lock to protect against concurrent scans, recovery better do it, too. >> > > We are already taking cleanup lock for this, not sure what exactly is > missed here, can you be more specific? Never mind, I'm wrong. >> + /* >> + * There is no harm in releasing the lock on old bucket before new bucket >> + * during replay as no other bucket splits can be happening concurrently. >> + */ >> + if (BufferIsValid(oldbuf)) >> + UnlockReleaseBuffer(oldbuf); >> >> And I'm not sure this is safe either. The issue isn't that there >> might be another bucket split happening concurrently, but that there >> might be scans that get confused by seeing the bucket split flag set >> before the new bucket is created or the metapage updated. > > During scan we check for bucket being populated, so this should be safe. Yeah, I guess so. This business of the standby taking weaker locks still seems scary to me, as in hash_xlog_split_allocpage above. >> Seems like >> it would be safer to keep all the locks for this one. > > If you feel better that way, I can change it and add a comment saying > something like "we could release lock on bucket being split early as > well, but doing here to be consistent with normal operation" I think that might not be a bad idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 9, 2017 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Right, if we use XLogReadBufferForRedoExtended() instead of >> XLogReadBufferExtended()/LockBufferForCleanup during relay routine, >> then it will restore FPI when required and can serve our purpose of >> acquiring clean up lock on primary bucket page. Yet another way could >> be to store some information like block number, relfilenode, forknum >> (maybe we can get away with some less info) of primary bucket in the >> fixed part of each of these records and then using that read the page >> and then take a cleanup lock. Now, as in most cases, this buffer >> needs to be registered only in cases when there are multiple overflow >> pages, I think having fixed information might hurt in some cases. > > Just because the WAL record gets larger? I think you could arrange to > record the data only in the cases where it is needed, but I'm also not > sure it would matter that much anyway. Off-hand, it seems better than > having to mark the primary bucket page dirty (because you have to set > the LSN) every time any page in the chain is modified. > Do we really need to set LSN on this page (or mark it dirty), if so why? Are you worried about restoration of FPI or something else? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Do we really need to set LSN on this page (or mark it dirty), if so > why? Are you worried about restoration of FPI or something else? I haven't thought through all of the possible consequences and am a bit to tired to do so just now, but doesn't it seem rather risky to invent a whole new way of using these xlog functions? src/backend/access/transam/README describes how to do write-ahead logging properly, and neither MarkBufferDirty() nor PageSetLSN() is described as an optional step. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Do we really need to set LSN on this page (or mark it dirty), if so >> why? Are you worried about restoration of FPI or something else? > > I haven't thought through all of the possible consequences and am a > bit to tired to do so just now, but doesn't it seem rather risky to > invent a whole new way of using these xlog functions? > src/backend/access/transam/README describes how to do write-ahead > logging properly, and neither MarkBufferDirty() nor PageSetLSN() is > described as an optional step. > Just to salvage my point, I think this is not the first place where we register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we register heap and vm buffers but set the LSN conditionally on heap buffer. Having said that, I see the value of your point and I am open to doing it that way if you feel that is a better way. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Do we really need to set LSN on this page (or mark it dirty), if so >>> why? Are you worried about restoration of FPI or something else? >> >> I haven't thought through all of the possible consequences and am a >> bit to tired to do so just now, but doesn't it seem rather risky to >> invent a whole new way of using these xlog functions? >> src/backend/access/transam/README describes how to do write-ahead >> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is >> described as an optional step. > > Just to salvage my point, I think this is not the first place where we > register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we > register heap and vm buffers but set the LSN conditionally on heap > buffer. Having said that, I see the value of your point and I am open > to doing it that way if you feel that is a better way. Right, we did that, and it seems to have worked. But it was a scary exception that required a lot of thought. Now, since we did it once, we could do it again, but I am not sure it is for the best. In the case of vacuum, we knew that a vacuum on a large table could otherwise emit an FPI for every page, which would almost double the amount of write I/O generated by a vacuum - instead of WAL records + heap pages, you'd be writing FPIs + heap pages, a big increase. Now here I think that the amount of write I/O that it's costing us is not so clear. Unless it's going to be really big, I'd rather do this in some way we can think is definitely safe. Also, if we want to avoid dirtying the primary bucket page by setting the LSN, IMHO the way to do that is to store the block number of the primary bucket page without registering the buffer, and then during recovery, lock that block. That seems cleaner than hoping that we can take an FPI without setting the page LSN. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 10, 2017 at 6:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> Do we really need to set LSN on this page (or mark it dirty), if so >>>> why? Are you worried about restoration of FPI or something else? >>> >>> I haven't thought through all of the possible consequences and am a >>> bit to tired to do so just now, but doesn't it seem rather risky to >>> invent a whole new way of using these xlog functions? >>> src/backend/access/transam/README describes how to do write-ahead >>> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is >>> described as an optional step. >> >> Just to salvage my point, I think this is not the first place where we >> register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we >> register heap and vm buffers but set the LSN conditionally on heap >> buffer. Having said that, I see the value of your point and I am open >> to doing it that way if you feel that is a better way. > > Right, we did that, and it seems to have worked. But it was a scary > exception that required a lot of thought. Now, since we did it once, > we could do it again, but I am not sure it is for the best. In the > case of vacuum, we knew that a vacuum on a large table could otherwise > emit an FPI for every page, which would almost double the amount of > write I/O generated by a vacuum - instead of WAL records + heap pages, > you'd be writing FPIs + heap pages, a big increase. Now here I think > that the amount of write I/O that it's costing us is not so clear. > Unless it's going to be really big, I'd rather do this in some way we > can think is definitely safe. > > Also, if we want to avoid dirtying the primary bucket page by setting > the LSN, IMHO the way to do that is to store the block number of the > primary bucket page without registering the buffer, and then during > recovery, lock that block. That seems cleaner than hoping that we can > take an FPI without setting the page LSN. > I was thinking that we will use REGBUF_NO_IMAGE flag as is used in XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra I/O and will make it safe as well. I think that makes registering the buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 10, 2017 at 8:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I was thinking that we will use REGBUF_NO_IMAGE flag as is used in > XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra > I/O and will make it safe as well. I think that makes registering the > buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record. Hmm, yeah, that might work, though I would probably want to spend some time studying the patch and thinking about it before deciding for sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Great, thanks. 0001 looks good to me now, so committed. >>> >>> Committed 0002. >> >> Here are some initial review thoughts on 0003 based on a first read-through. > > More thoughts on the main patch: > > /* > + * Change the shared buffer state in critical section, > + * otherwise any error could make it unrecoverable after > + * recovery. > + */ > + START_CRIT_SECTION(); > + > + /* > * Insert tuple on new page, using _hash_pgaddtup to ensure > * correct ordering by hashkey. This is a tad inefficient > * since we may have to shuffle itempointers repeatedly. > * Possible future improvement: accumulate all the items for > * the new page and qsort them before insertion. > */ > (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); > > + END_CRIT_SECTION(); > > No way. You have to start the critical section before making any page > modifications and keep it alive until all changes have been logged. > I think what we need to do here is to accumulate all the tuples that need to be added to new bucket page till either that page has no more space or there are no more tuples remaining in an old bucket. Then in a critical section, add them to the page using _hash_pgaddmultitup and log the entire new bucket page contents as is currently done in patch log_split_page(). Now, here we can choose to log the individual tuples as well instead of a complete page, however not sure if there is any benefit for doing the same because XLogRecordAssemble() will anyway remove the empty space from the page. Let me know if you have something else in mind. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> /* >> + * Change the shared buffer state in critical section, >> + * otherwise any error could make it unrecoverable after >> + * recovery. >> + */ >> + START_CRIT_SECTION(); >> + >> + /* >> * Insert tuple on new page, using _hash_pgaddtup to ensure >> * correct ordering by hashkey. This is a tad inefficient >> * since we may have to shuffle itempointers repeatedly. >> * Possible future improvement: accumulate all the items for >> * the new page and qsort them before insertion. >> */ >> (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); >> >> + END_CRIT_SECTION(); >> >> No way. You have to start the critical section before making any page >> modifications and keep it alive until all changes have been logged. >> > > I think what we need to do here is to accumulate all the tuples that > need to be added to new bucket page till either that page has no more > space or there are no more tuples remaining in an old bucket. Then in > a critical section, add them to the page using _hash_pgaddmultitup and > log the entire new bucket page contents as is currently done in patch > log_split_page(). I agree. > Now, here we can choose to log the individual > tuples as well instead of a complete page, however not sure if there > is any benefit for doing the same because XLogRecordAssemble() will > anyway remove the empty space from the page. Let me know if you have > something else in mind. Well, if you have two pages that are 75% full, and you move a third of the tuples from one of them into the other, it's going to be about four times more efficient to log only the moved tuples than the whole page. But logging the whole filled page wouldn't be wrong, just somewhat inefficient. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Great, thanks. 0001 looks good to me now, so committed. >>> >>> Committed 0002. >> >> Here are some initial review thoughts on 0003 based on a first read-through. > > More thoughts on the main patch: > > The text you've added to the hash index README throws around the term > "single WAL entry" pretty freely. For example: "An insertion that > causes an addition of an overflow page is logged as a single WAL entry > preceded by a WAL entry for new overflow page required to insert a > tuple." When you say a single WAL entry, you make it sound like > there's only one, but because it's preceded by another WAL entry, > there are actually two. I would rephase this to just avoid using the > word "single" this way. For example: "If an insertion causes the > addition of an overflow page, there will be one WAL entry for the new > overflow page and a second entry for the insert itself." > Okay, changed as per suggestion. I have also removed the second part of the above comment which states that newly allocated overflow page can be used by concurrent transactions. That won't be true after fixing release/reacquire stuff in _hash_addovflpage. I have also rephrased another similar usage of "single WAL entry". > +mode anyway). It would seem natural to complete the split in VACUUM, but since > +splitting a bucket might require allocating a new page, it might fail if you > +run out of disk space. That would be bad during VACUUM - the reason for > +running VACUUM in the first place might be that you run out of disk space, > +and now VACUUM won't finish because you're out of disk space. In contrast, > +an insertion can require enlarging the physical file anyway. > > But VACUUM actually does try to complete the splits, so this is wrong, I think. > As discussed up thread, there is no need to change. > +Squeeze operation moves tuples from one of the buckets later in the chain to > > A squeeze operation > > +As Squeeze operation involves writing multiple atomic operations, it is > > As a squeeze operation involves multiple atomic operations > Changed as per suggestion. > + if (!xlrec.is_primary_bucket_page) > + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); > > So, in hashbucketcleanup, the page is registered and therefore an FPI > will be taken if this is the first reference to this page since the > checkpoint, but hash_xlog_delete won't restore the FPI. I think that > exposes us to a torn-page hazard. > _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same > disease, as does _hash_squeezebucket/hash_xlog_move_page_contents. > Not sure exactly what the fix should be. > I have used REGBUF_NO_IMAGE while registering buffer to avoid this hazard and during replay used XLogReadBufferForRedoExtended() to read block and take cleanup lock on it. > + /* > + * As bitmap page doesn't have standard page layout, so this will > + * allow us to log the data. > + */ > + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD); > + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32)); > > If the page doesn't have a standard page layout, why are we passing > REGBUF_STANDARD? But I think you've hacked things so that bitmap > pages DO have a standard page layout, per the change to > _hash_initbitmapbuffer, in which case the comment seems wrong. > Yes, the comment is obsolete and I have removed it. We need standard page layout for bitmap page, so that full page writes won't exclude the bitmappage data. > > + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE); > + > + PageSetLSN(BufferGetPage(ovflbuf), recptr); > + PageSetLSN(BufferGetPage(buf), recptr); > + > + if (BufferIsValid(mapbuf)) > + PageSetLSN(BufferGetPage(mapbuf), recptr); > + > + if (BufferIsValid(newmapbuf)) > + PageSetLSN(BufferGetPage(newmapbuf), recptr); > > I think you forgot to call PageSetLSN() on metabuf. (There are 5 > block references in the write-ahead log record but only 4 calls to > PageSetLSN ... surely that can't be right!) > Right, fixed. > + /* > + * We need to release the locks once the prev pointer of overflow bucket > + * and next of left bucket are set, otherwise concurrent read might skip > + * the bucket. > + */ > + if (BufferIsValid(leftbuf)) > + UnlockReleaseBuffer(leftbuf); > + UnlockReleaseBuffer(ovflbuf); > > I don't believe the comment. Postponing the lock release wouldn't > cause the concurrent read to skip the bucket. It might cause it to be > delayed unnecessarily, but the following comment already addresses > that point. I think you can just nuke this comment. > Okay, removed the comment. > + xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false; > + xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false; > > Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar? > Changed as per suggestion. > + /* > + * We release locks on writebuf and bucketbuf at end of replay operation > + * to ensure that we hold lock on primary bucket page till end of > + * operation. We can optimize by releasing the lock on write buffer as > + * soon as the operation for same is complete, if it is not same as > + * primary bucket page, but that doesn't seem to be worth complicating the > + * code. > + */ > + if (BufferIsValid(writebuf)) > + UnlockReleaseBuffer(writebuf); > + > + if (BufferIsValid(bucketbuf)) > + UnlockReleaseBuffer(bucketbuf); > > Here in hash_xlog_squeeze_page(), you wait until the very end to > release these locks, but in e.g. hash_xlog_addovflpage you release > them considerably earlier for reasons that seem like they would also > be valid here. I think you could move this back to before the updates > to the metapage and bitmap page. > Sure, doesn't see any problem with releasing locks early, so changed as per suggestion. > + /* > + * Note: in normal operation, we'd update the metapage while still > + * holding lock on the page we inserted into. But during replay it's > + * not necessary to hold that lock, since no other index updates can > + * be happening concurrently. > + */ > > This comment in hash_xlog_init_bitmap_page seems to be off-base, > because we didn't just insert into a page. I'm not disputing that > it's safe, though: not only can nobody else be modifying it because > we're in recovery, but also nobody else can see it yet; the creating > transaction hasn't yet committed. > Fixed the comment and mentioned the second reason as suggested by you (that sounds appropriate here). > + /* > + * Note that we still update the page even if it was restored from a full > + * page image, because the bucket flag is not included in the image. > + */ > > Really? Why not? (Because it's part of the special space?) > Added a comment as discussed above. > Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to > guard against concurrent scans? Changed and made it similar to normal operations by taking cleanup lock. > + /* > + * There is no harm in releasing the lock on old bucket before new bucket > + * during replay as no other bucket splits can be happening concurrently. > + */ > + if (BufferIsValid(oldbuf)) > + UnlockReleaseBuffer(oldbuf); > > And I'm not sure this is safe either. The issue isn't that there > might be another bucket split happening concurrently, but that there > might be scans that get confused by seeing the bucket split flag set > before the new bucket is created or the metapage updated. Seems like > it would be safer to keep all the locks for this one. > Changed as discussed. > + * Initialise the new bucket page, here we can't complete zeroed the page > > I think maybe a good way to write these would be "we can't simply zero > the page". And Initialise -> Initialize. > I have changed the comment, see if that looks okay to you. > /* > + * Change the shared buffer state in critical section, > + * otherwise any error could make it unrecoverable after > + * recovery. > + */ > + START_CRIT_SECTION(); > + > + /* > * Insert tuple on new page, using _hash_pgaddtup to ensure > * correct ordering by hashkey. This is a tad inefficient > * since we may have to shuffle itempointers repeatedly. > * Possible future improvement: accumulate all the items for > * the new page and qsort them before insertion. > */ > (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); > > + END_CRIT_SECTION(); > > No way. You have to start the critical section before making any page > modifications and keep it alive until all changes have been logged. > Changed as discussed up thread. > hash_redo's mappings of record types to function names is a little > less regular than seems desirable. Generally, the function name is > the lower-case version of the record type with "hash" and "xlog" > switched. But XLOG_HASH_ADD_OVFL_PAGE and > XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern. > Okay, changed as per suggestion and I went ahead and change the name of corresponding xlog record structures as well because those also match the function name pattern. > It seems like a good test to do with this patch would be to set up a > pgbench test on the master with a hash index replacing the usual btree > index. Then, set up a standby and run a read-only pgbench on the > standby while a read-write pgbench test runs on the master. Maybe > you've already tried something like that? > I also think so and apart from that I think it makes sense to perform recovery test by Jeff Janes tool and probably tests with wal_consistency_check. These tests are already running from past seven hours or so and I will keep them running for the whole night to see if there is any discrepancy. Thanks to Kuntal and Ashutosh for helping me in doing these stress tests. Note that I have fixed all other comments raised by you in another e-mail including adding the test for snapshot too old (I have prepared a separate patch for the test. I have added just one test to keep the timing short because each test will take minimum 6s to complete. This 6s requirement is what is required by existing tests or any other test for snapshot too old). Let me know if I have missed anything? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Mar 12, 2017 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> /* >>> + * Change the shared buffer state in critical section, >>> + * otherwise any error could make it unrecoverable after >>> + * recovery. >>> + */ >>> + START_CRIT_SECTION(); >>> + >>> + /* >>> * Insert tuple on new page, using _hash_pgaddtup to ensure >>> * correct ordering by hashkey. This is a tad inefficient >>> * since we may have to shuffle itempointers repeatedly. >>> * Possible future improvement: accumulate all the items for >>> * the new page and qsort them before insertion. >>> */ >>> (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); >>> >>> + END_CRIT_SECTION(); >>> >>> No way. You have to start the critical section before making any page >>> modifications and keep it alive until all changes have been logged. >>> >> >> I think what we need to do here is to accumulate all the tuples that >> need to be added to new bucket page till either that page has no more >> space or there are no more tuples remaining in an old bucket. Then in >> a critical section, add them to the page using _hash_pgaddmultitup and >> log the entire new bucket page contents as is currently done in patch >> log_split_page(). > > I agree. > Okay, I have changed like that in the patch. >> Now, here we can choose to log the individual >> tuples as well instead of a complete page, however not sure if there >> is any benefit for doing the same because XLogRecordAssemble() will >> anyway remove the empty space from the page. Let me know if you have >> something else in mind. > > Well, if you have two pages that are 75% full, and you move a third of > the tuples from one of them into the other, it's going to be about > four times more efficient to log only the moved tuples than the whole > page. > I don't see how this could happen during split? I mean if you are moving 25% tuples from old bucket page to a new bucket page which is 75% full, it will log the new bucket page only after it gets full. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 13, 2017 at 6:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> Great, thanks. 0001 looks good to me now, so committed. >>>> >>>> Committed 0002. >>> >>> Here are some initial review thoughts on 0003 based on a first read-through. >> > >> It seems like a good test to do with this patch would be to set up a >> pgbench test on the master with a hash index replacing the usual btree >> index. Then, set up a standby and run a read-only pgbench on the >> standby while a read-write pgbench test runs on the master. Maybe >> you've already tried something like that? >> > > I also think so and apart from that I think it makes sense to perform > recovery test by Jeff Janes tool and probably tests with > wal_consistency_check. These tests are already running from past seven > hours or so and I will keep them running for the whole night to see if > there is any discrepancy. > We didn't found any issue with the above testing. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > We didn't found any issue with the above testing. Great! I've committed the latest version of the patch, with some cosmetic changes. It would be astonishing if there weren't a bug or two left, but I think overall this is very solid work, and I think it's time to put this out there and see how things go. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Great! I've committed the latest version of the patch, with some > cosmetic changes. Woo hoo! That's been a bee in the bonnet for, um, decades. regards, tom lane
On Tue, Mar 14, 2017 at 1:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Great! I've committed the latest version of the patch, with some >> cosmetic changes. > > Woo hoo! That's been a bee in the bonnet for, um, decades. Yeah. I'm pretty happy to see that go in. It's become pretty clear to me that there are a bunch of other things about hash indexes which are not exactly great, the worst of which is the way they grow by DOUBLING IN SIZE. Mithun has submitted a patch which - if it's acceptable - would alleviate that to some degree by splitting up each doubling into four separate increments. But that still could lead to very large growth increments on big indexes, like your 32GB index suddenly growing - in one fell swoop - to 40GB. That's admittedly a lot better than what will happen right now, which is instantaneous growth to 64GB, but it's not great, and it's not altogether clear how it could be fixed in a really satisfying way. Other things that are not so great: - no multi-column support - no amcanunique support - every insert dirties the metapage - splitting is generally too aggressive; very few overflow pages are ever created unless you have piles of duplicates Still, this is a big step forward. I think it will be useful for people with long text strings or other very wide data that they want to index, and hopefully we'll get some feedback from users about where this works well and where it doesn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It's become pretty clear to me that there are a bunch of other things > about hash indexes which are not exactly great, the worst of which is > the way they grow by DOUBLING IN SIZE. Uh, what? Growth should happen one bucket-split at a time. > Other things that are not so great: > - no multi-column support > - no amcanunique support > - every insert dirties the metapage > - splitting is generally too aggressive; very few overflow pages are > ever created unless you have piles of duplicates Yeah. It's a bit hard to see how to add multi-column support unless you give up the property of allowing queries on a subset of the index columns. Lack of amcanunique seems like mostly a round-tuit shortage. The other two are implementation deficiencies that maybe we can remedy someday. Another thing I'd like to see is support for 64-bit hash values. But all of these were mainly blocked by people not wanting to sink effort into hash indexes as long as they were unusable for production due to lack of WAL support. So this is a huge step forward. regards, tom lane
On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It's become pretty clear to me that there are a bunch of other things >> about hash indexes which are not exactly great, the worst of which is >> the way they grow by DOUBLING IN SIZE. > > Uh, what? Growth should happen one bucket-split at a time. Technically, the buckets are created one at a time, but because of the way hashm_spares works, the primary bucket pages for all bucket from 2^N to 2^{N+1}-1 must be physically consecutive. See _hash_alloc_buckets. >> Other things that are not so great: > >> - no multi-column support >> - no amcanunique support >> - every insert dirties the metapage >> - splitting is generally too aggressive; very few overflow pages are >> ever created unless you have piles of duplicates > > Yeah. It's a bit hard to see how to add multi-column support unless you > give up the property of allowing queries on a subset of the index columns. > Lack of amcanunique seems like mostly a round-tuit shortage. The other > two are implementation deficiencies that maybe we can remedy someday. > > Another thing I'd like to see is support for 64-bit hash values. > > But all of these were mainly blocked by people not wanting to sink effort > into hash indexes as long as they were unusable for production due to lack > of WAL support. So this is a huge step forward. Agreed, on all points. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> It's become pretty clear to me that there are a bunch of other things >>> about hash indexes which are not exactly great, the worst of which is >>> the way they grow by DOUBLING IN SIZE. >> Uh, what? Growth should happen one bucket-split at a time. > Technically, the buckets are created one at a time, but because of the > way hashm_spares works, the primary bucket pages for all bucket from > 2^N to 2^{N+1}-1 must be physically consecutive. See > _hash_alloc_buckets. Right, but we only fill those pages one at a time. It's true that as soon as we need another overflow page, that's going to get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the index will grow quite a lot. But any modern filesystem should handle that without much difficulty by treating the index as a sparse file. There may be some work to be done in places like pg_basebackup to recognize and deal with sparse files, but it doesn't seem like a reason to panic. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Robert Haas <robertmhaas@gmail.com> writes: > >>> It's become pretty clear to me that there are a bunch of other things > >>> about hash indexes which are not exactly great, the worst of which is > >>> the way they grow by DOUBLING IN SIZE. > > >> Uh, what? Growth should happen one bucket-split at a time. > > > Technically, the buckets are created one at a time, but because of the > > way hashm_spares works, the primary bucket pages for all bucket from > > 2^N to 2^{N+1}-1 must be physically consecutive. See > > _hash_alloc_buckets. > > Right, but we only fill those pages one at a time. > > It's true that as soon as we need another overflow page, that's going to > get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the > index will grow quite a lot. But any modern filesystem should handle > that without much difficulty by treating the index as a sparse file. Uh, last I heard we didn't allow or want sparse files in the backend because then we have to handle a possible out-of-disk-space failure on every write. If we think they're ok to do, it'd be awful nice to figure out a way for VACUUM to turn an entirely-empty 1G chunk into a sparse file.. > There may be some work to be done in places like pg_basebackup to > recognize and deal with sparse files, but it doesn't seem like a > reason to panic. Well, and every file-based backup tool out there.. Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> It's true that as soon as we need another overflow page, that's going to >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the >> index will grow quite a lot. But any modern filesystem should handle >> that without much difficulty by treating the index as a sparse file. > Uh, last I heard we didn't allow or want sparse files in the backend > because then we have to handle a possible out-of-disk-space failure on > every write. For a hash index, this would happen during a bucket split, which would need to be resilient against out-of-disk-space anyway. >> There may be some work to be done in places like pg_basebackup to >> recognize and deal with sparse files, but it doesn't seem like a >> reason to panic. > Well, and every file-based backup tool out there.. Weren't you the one leading the charge to deprecate use of file-based backup? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> It's true that as soon as we need another overflow page, that's going to > >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the > >> index will grow quite a lot. But any modern filesystem should handle > >> that without much difficulty by treating the index as a sparse file. > > > Uh, last I heard we didn't allow or want sparse files in the backend > > because then we have to handle a possible out-of-disk-space failure on > > every write. > > For a hash index, this would happen during a bucket split, which would > need to be resilient against out-of-disk-space anyway. We wouldn't attempt to use the area of the file which is not yet allocated except when doing a bucket split? If that's the case then this does seem to at least be less of an issue, though I hope we put in appropriate comments about it. > >> There may be some work to be done in places like pg_basebackup to > >> recognize and deal with sparse files, but it doesn't seem like a > >> reason to panic. > > > Well, and every file-based backup tool out there.. > > Weren't you the one leading the charge to deprecate use of file-based > backup? No, nor do I see how we would ever be able to deprecate file-based backups. If anything, I'd like to see us improve our support for them. I'm certainly curious where the notion that I was ever in favor of deprecating them came from, particularly given all of the effort that David and I have been pouring into our favorite file-based backup tool over the past few years. Thanks! Stephen
On 15/03/17 06:29, Robert Haas wrote: > > Great! I've committed the latest version of the patch, with some > cosmetic changes. > > It would be astonishing if there weren't a bug or two left, but I > think overall this is very solid work, and I think it's time to put > this out there and see how things go. > Awesome, great work! Cheers Mark
On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> We didn't found any issue with the above testing. > > Great! I've committed the latest version of the patch, with some > cosmetic changes. > Thanks a lot. I have noticed that the test case patch for "snapshot too old" is not committed. Do you want to leave that due to timing requirement or do you want me to submit it separately so that we can deal with it separately and may take an input from Kevin? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Stephen Frost <sfrost@snowman.net> writes: >> > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> >> It's true that as soon as we need another overflow page, that's going to >> >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the >> >> index will grow quite a lot. But any modern filesystem should handle >> >> that without much difficulty by treating the index as a sparse file. >> >> > Uh, last I heard we didn't allow or want sparse files in the backend >> > because then we have to handle a possible out-of-disk-space failure on >> > every write. >> >> For a hash index, this would happen during a bucket split, which would >> need to be resilient against out-of-disk-space anyway. > > We wouldn't attempt to use the area of the file which is not yet > allocated except when doing a bucket split? > That's right. > If that's the case then > this does seem to at least be less of an issue, though I hope we put in > appropriate comments about it. > I think we have sufficient comments in code especially on top of function _hash_alloc_buckets(). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit, * Amit Kapila (amit.kapila16@gmail.com) wrote: > On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote: > > If that's the case then > > this does seem to at least be less of an issue, though I hope we put in > > appropriate comments about it. > > I think we have sufficient comments in code especially on top of > function _hash_alloc_buckets(). I don't see any comments regarding how we have to be sure to handle an out-of-space case properly in the middle of a file because we've made it sparse. I do see that mdwrite() should handle an out-of-disk-space case, though that just makes me wonder what's different here compared to normal relations that we don't have an issue with a sparse WAL'd hash index but we can't handle it if a normal relation is sparse. Additional comments here would be overkill if we think that the lower layers are already all set up to handle such a case properly and we don't have to do anything special, but I had understood that we didn't generally think that sparse files would just work. I'm certainly happy to be wrong there because, if that's the case, it'd be a great way to improve the problems we have returning large chunks of free space in the middle of a relation to the OS, but if there is a real concern here then we need to work out what it is and probably add comments or possibly add code to address whatever it is. Thanks! Stephen
On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote: >> I think we have sufficient comments in code especially on top of >> function _hash_alloc_buckets(). > > I don't see any comments regarding how we have to be sure to handle > an out-of-space case properly in the middle of a file because we've made > it sparse. > > I do see that mdwrite() should handle an out-of-disk-space case, though > that just makes me wonder what's different here compared to normal > relations that we don't have an issue with a sparse WAL'd hash index but > we can't handle it if a normal relation is sparse. I agree. I think that what hash indexes are doing here is inconsistent with what we do for btree indexes and the heap. And I don't think it would be bad to fix that. We could, of course, go the other way and do what Tom is suggesting - insist that everybody's got to be prepared for sparse files, but I would view that as something of a U-turn. I think hash indexes are like this because nobody's really worried about hash indexes because they haven't been crash-safe or performant. Now that we've made them crash safe and are on the way to making them performant, fixing other things is, as Tom already said, a lot more interesting. Now, that having been said, I'm not sure it's a good idea to tinker with the behavior for v10. We could change the new-splitpoint code so that it loops over all the pages in the new splitpoint and zeroes them all, instead of just the last one. If we all agree that's how it should work, then it's probably not a lot of work to make the change. But if what's needed is anything more than that or we don't all agree, then we'd better just leave it alone and we can revisit it for v11. It's too late to start making significant or controversial design changes at this point, and you could argue that this is a preexisting design defect which the WAL-logging patch wasn't necessarily obligated to fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Now, that having been said, I'm not sure it's a good idea to tinker > with the behavior for v10. We could change the new-splitpoint code so > that it loops over all the pages in the new splitpoint and zeroes them > all, instead of just the last one. Why would we do that? That would change the behavior from something that's arguably OK (at least given the right filesystem) to something that's clearly not very OK. > It's too late to start making significant or controversial design > changes at this point, and you could argue that this is a preexisting > design defect which the WAL-logging patch wasn't necessarily obligated > to fix. Yes, and it is certainly that, and no this patch wasn't chartered to fix it. I don't have a problem with leaving things like this for v10. FWIW, I'm not certain that Stephen is correct to claim that we have some concrete problem with sparse files. We certainly don't *depend* on sparse storage anyplace else, nor write data in a way that would be likely to trigger it; but I'm not aware that we need to work hard to avoid it. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > I do see that mdwrite() should handle an out-of-disk-space case, though > that just makes me wonder what's different here compared to normal > relations that we don't have an issue with a sparse WAL'd hash index but > we can't handle it if a normal relation is sparse. *Any* write has to be prepared to handle errors. There's always a risk of EIO, and on a COW filesystem you might well get ENOSPC even when you think you're overwriting previously-allocated storage. All that we are doing by pre-allocating storage is reducing the risks a bit, not guaranteeing that no error will happen. regards, tom lane
Tom, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > FWIW, I'm not certain that Stephen is correct to claim that we have > some concrete problem with sparse files. We certainly don't *depend* > on sparse storage anyplace else, nor write data in a way that would be > likely to trigger it; but I'm not aware that we need to work hard to > avoid it. I don't have any concrete evidence that there is an issue, just a recollection of bringing up exactly the idea of turning entirely empty 1G segments into sparse files to free that space back to the filesystem during a VACUUM during a discussion somewhere along the way and being told that it'd be bad because we would run into issues later when we try to write those pages back out and discover we're out of disk space. Naturally, there's a lot to consider with such a change to VACUUM like how we would WAL that and how we'd make sure that nothing is about to try and use any of those pages (perhaps start at the end of the segment and lock the pages, or just try to get an exclusive lock on the table as we do when we try to truncate the relation because there's free space at the end?), but it'd be a very interesting project to consider, if we are willing to accept sparse heap files. Another interesting idea might be to somehow change the now-empty relation into just a place-holder that says "assume I'm 1G in size and empty." but actually have the file be shrunk. Anyway, I tend to agree with the sentiment that we don't need this patch to change the behavior here and perhaps it'll be good to see what happens when people start using these sparsh hash indexes, maybe that will shed some light on if we have anything to worry about here or not, and if not then we can consider having sparse files elsewhere. Thanks! Stephen
On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, I'm not certain that Stephen is correct to claim that we have > some concrete problem with sparse files. We certainly don't *depend* > on sparse storage anyplace else, nor write data in a way that would be > likely to trigger it; but I'm not aware that we need to work hard to > avoid it. That theory seems inconsistent with how mdextend() works. My understanding is that we zero-fill the new blocks before populating them with actual data precisely to avoid running out of disk space due to deferred allocation at the OS level. If we don't care about failures due to deferred allocation at the OS level, we can rip that logic out and improve the performance of relation extension considerably. If we do care about failures due to deferred allocation, then leaving holes in the file is a bad idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 14, 2017 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> We didn't found any issue with the above testing. >> >> Great! I've committed the latest version of the patch, with some >> cosmetic changes. >> > > Thanks a lot. I have noticed that the test case patch for "snapshot > too old" is not committed. Do you want to leave that due to timing > requirement or do you want me to submit it separately so that we can > deal with it separately and may take an input from Kevin? I've committed that now as well. Let's see what the buildfarm thinks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > FWIW, I'm not certain that Stephen is correct to claim that we have > > some concrete problem with sparse files. We certainly don't *depend* > > on sparse storage anyplace else, nor write data in a way that would be > > likely to trigger it; but I'm not aware that we need to work hard to > > avoid it. > > That theory seems inconsistent with how mdextend() works. My > understanding is that we zero-fill the new blocks before populating > them with actual data precisely to avoid running out of disk space due > to deferred allocation at the OS level. If we don't care about > failures due to deferred allocation at the OS level, we can rip that > logic out and improve the performance of relation extension > considerably. If we do care about failures due to deferred > allocation, then leaving holes in the file is a bad idea. That is a fantastic point. Thanks! Stephen
Robert Haas <robertmhaas@gmail.com> writes: > That theory seems inconsistent with how mdextend() works. My > understanding is that we zero-fill the new blocks before populating > them with actual data precisely to avoid running out of disk space due > to deferred allocation at the OS level. If we don't care about > failures due to deferred allocation at the OS level, we can rip that > logic out and improve the performance of relation extension > considerably. See my reply to Stephen. The fact that this fails to guarantee no ENOSPC on COW filesystems doesn't mean that it's not worth doing on other filesystems. We're reducing the risk, not eliminating it, but reducing risk is still a worthwhile activity. regards, tom lane
Tom, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > That theory seems inconsistent with how mdextend() works. My > > understanding is that we zero-fill the new blocks before populating > > them with actual data precisely to avoid running out of disk space due > > to deferred allocation at the OS level. If we don't care about > > failures due to deferred allocation at the OS level, we can rip that > > logic out and improve the performance of relation extension > > considerably. > > See my reply to Stephen. The fact that this fails to guarantee no > ENOSPC on COW filesystems doesn't mean that it's not worth doing on > other filesystems. We're reducing the risk, not eliminating it, > but reducing risk is still a worthwhile activity. Considering how much work we end up doing to extend a relation and how we know that's been a hotspot, I'm not entirely sure I agree that avoiding the relativly infrequent out-of-disk-space concern when extending the relation (instead of letting it happen when we go to actually write data into the page) really is a good trade-off to make. Thanks! Stephen
On Wed, Mar 15, 2017 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> That theory seems inconsistent with how mdextend() works. My >> understanding is that we zero-fill the new blocks before populating >> them with actual data precisely to avoid running out of disk space due >> to deferred allocation at the OS level. If we don't care about >> failures due to deferred allocation at the OS level, we can rip that >> logic out and improve the performance of relation extension >> considerably. > > See my reply to Stephen. The fact that this fails to guarantee no > ENOSPC on COW filesystems doesn't mean that it's not worth doing on > other filesystems. We're reducing the risk, not eliminating it, > but reducing risk is still a worthwhile activity. Well, then it would presumably be worth reducing for hash indexes, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 15, 2017 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote: >>> I think we have sufficient comments in code especially on top of >>> function _hash_alloc_buckets(). >> >> I don't see any comments regarding how we have to be sure to handle >> an out-of-space case properly in the middle of a file because we've made >> it sparse. >> >> I do see that mdwrite() should handle an out-of-disk-space case, though >> that just makes me wonder what's different here compared to normal >> relations that we don't have an issue with a sparse WAL'd hash index but >> we can't handle it if a normal relation is sparse. > > I agree. I think that what hash indexes are doing here is > inconsistent with what we do for btree indexes and the heap. And I > don't think it would be bad to fix that. We could, of course, go the > other way and do what Tom is suggesting - insist that everybody's got > to be prepared for sparse files, but I would view that as something of > a U-turn. I think hash indexes are like this because nobody's really > worried about hash indexes because they haven't been crash-safe or > performant. Now that we've made them crash safe and are on the way to > making them performant, fixing other things is, as Tom already said, a > lot more interesting. > > Now, that having been said, I'm not sure it's a good idea to tinker > with the behavior for v10. We could change the new-splitpoint code so > that it loops over all the pages in the new splitpoint and zeroes them > all, instead of just the last one. > Yeah, but I am slightly afraid that apart from consuming too much space, it will make the operation lot slower when we have to perform the allocation step. Another possibility is to do that when we actually use/allocate the bucket? As of now, _hash_getnewbuf assumes that the block is pre-existing and avoid the actual read/extend by using RBM_ZERO_AND_LOCK mode. I think we can change it so that it forces a new allocation (if the block doesn't exist) on need. I agree this is somewhat more change than what you have proposed to circumvent the sparse file behavior, but may not be a ton more work if we decide to fix and has the advantage of allocating the space on disk on actual need. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com