Re: [HACKERS] Write Ahead Logging for Hash Indexes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Write Ahead Logging for Hash Indexes
Date
Msg-id CA+TgmoZva9Mr62A4Gqbz6uB2iUwafb8boMJ0FMpGP_3ThnwTWQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Great, thanks.  0001 looks good to me now, so committed.
>>
>> Committed 0002.
>
> Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:

The text you've added to the hash index README throws around the term
"single WAL entry" pretty freely.  For example: "An insertion that
causes an addition of an overflow page is logged as a single WAL entry
preceded by a WAL entry for new overflow page required to insert a
tuple."  When you say a single WAL entry, you make it sound like
there's only one, but because it's preceded by another WAL entry,
there are actually two.  I would rephase this to just avoid using the
word "single" this way.  For example: "If an insertion causes the
addition of an overflow page, there will be one WAL entry for the new
overflow page and a second entry for the insert itself."

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.

But VACUUM actually does try to complete the splits, so this is wrong, I think.

+Squeeze operation moves tuples from one of the buckets later in the chain to

A squeeze operation

+As Squeeze operation involves writing multiple atomic operations, it is

As a squeeze operation involves multiple atomic operations

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);

So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Not sure exactly what the fix should be.

+            /*
+             * As bitmap page doesn't have standard page layout, so this will
+             * allow us to log the data.
+             */
+            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));

If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD?  But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

(The comment is also a little confusing grammatically, but let's iron
out the substantive issue first.)

+        recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+        PageSetLSN(BufferGetPage(ovflbuf), recptr);
+        PageSetLSN(BufferGetPage(buf), recptr);
+
+        if (BufferIsValid(mapbuf))
+            PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+        if (BufferIsValid(newmapbuf))
+            PageSetLSN(BufferGetPage(newmapbuf), recptr);

I think you forgot to call PageSetLSN() on metabuf.  (There are 5
block references in the write-ahead log record but only 4 calls to
PageSetLSN ... surely that can't be right!)

+    /*
+     * We need to release the locks once the prev pointer of overflow bucket
+     * and next of left bucket are set, otherwise concurrent read might skip
+     * the bucket.
+     */
+    if (BufferIsValid(leftbuf))
+        UnlockReleaseBuffer(leftbuf);
+    UnlockReleaseBuffer(ovflbuf);

I don't believe the comment.  Postponing the lock release wouldn't
cause the concurrent read to skip the bucket.  It might cause it to be
delayed unnecessarily, but the following comment already addresses
that point.  I think you can just nuke this comment.

+        xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+        xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;

Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?

+    /*
+     * We release locks on writebuf and bucketbuf at end of replay operation
+     * to ensure that we hold lock on primary bucket page till end of
+     * operation.  We can optimize by releasing the lock on write buffer as
+     * soon as the operation for same is complete, if it is not same as
+     * primary bucket page, but that doesn't seem to be worth complicating the
+     * code.
+     */
+    if (BufferIsValid(writebuf))
+        UnlockReleaseBuffer(writebuf);
+
+    if (BufferIsValid(bucketbuf))
+        UnlockReleaseBuffer(bucketbuf);

Here in hash_xlog_squeeze_page(), you wait until the very end to
release these locks, but in e.g. hash_xlog_addovflpage you release
them considerably earlier for reasons that seem like they would also
be valid here.  I think you could move this back to before the updates
to the metapage and bitmap page.

+        /*
+         * Note: in normal operation, we'd update the metapage while still
+         * holding lock on the page we inserted into.  But during replay it's
+         * not necessary to hold that lock, since no other index updates can
+         * be happening concurrently.
+         */

This comment in hash_xlog_init_bitmap_page seems to be off-base,
because we didn't just insert into a page.  I'm not disputing that
it's safe, though: not only can nobody else be modifying it because
we're in recovery, but also nobody else can see it yet; the creating
transaction hasn't yet committed.

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */

Really?  Why not?  (Because it's part of the special space?)

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans?  And also hash_xlog_split_complete?
And hash_xlog_delete?  If the regular code path is getting a cleanup
lock to protect against concurrent scans, recovery better do it, too.

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);

And I'm not sure this is safe either.  The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated.  Seems like
it would be safer to keep all the locks for this one.

+     * Initialise the new bucket page, here we can't complete zeroed the page

I think maybe a good way to write these would be "we can't simply zero
the page".  And Initialise -> Initialize.
                /*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*                 * Insert tuple on new page, using _hash_pgaddtup to ensure                 *
correctordering by hashkey.  This is a tad inefficient                 * since we may have to shuffle itempointers
repeatedly.                * Possible future improvement: accumulate all the items for                 * the new page
andqsort them before insertion.                 */                (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 

+                END_CRIT_SECTION();

No way.  You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.

hash_redo's mappings of record types to function names is a little
less regular than seems desirable.  Generally, the function name is
the lower-case version of the record type with "hash" and "xlog"
switched.  But XLOG_HASH_ADD_OVFL_PAGE and
XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index.  Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master.  Maybe
you've already tried something like that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Need a builtin way to run all tests faster manner
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] Need a builtin way to run all tests faster manner