Re: [HACKERS] Write Ahead Logging for Hash Indexes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] Write Ahead Logging for Hash Indexes
Date
Msg-id CAA4eK1LwZk0MaK0LncZz6hsqcHnamfJB47nA8xuwRp4CyLFKzw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> +mode anyway).  It would seem natural to complete the split in VACUUM, but since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space.  That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space.  In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>

Vacuum just tries to clean the remaining tuples from old bucket.  Only
insert or split operation tries to complete the split.

>
> +                if (!xlrec.is_primary_bucket_page)
> +                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>

Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page.  Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock.   Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

>
> +            /*
> +             * As bitmap page doesn't have standard page layout, so this will
> +             * allow us to log the data.
> +             */
> +            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> +            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD?  But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it.  We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> +    /*
> +     * Note that we still update the page even if it was restored from a full
> +     * page image, because the bucket flag is not included in the image.
> +     */
>
> Really?  Why not?  (Because it's part of the special space?)
>

Yes, because it's part of special space.  Do you want that to
mentioned in comments?

> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
> guard against concurrent scans?
>

I don't think so, because regular code takes it on old bucket to
protect the split from concurrent inserts and cleanup lock on new
bucket is just for the sake of consistency.

> And also hash_xlog_split_complete?

In regular code, we are not taking cleanup lock for this operation.

> And hash_xlog_delete?  If the regular code path is getting a cleanup
> lock to protect against concurrent scans, recovery better do it, too.
>

We are already taking cleanup lock for this, not sure what exactly is
missed here, can you be more specific?

> +    /*
> +     * There is no harm in releasing the lock on old bucket before new bucket
> +     * during replay as no other bucket splits can be happening concurrently.
> +     */
> +    if (BufferIsValid(oldbuf))
> +        UnlockReleaseBuffer(oldbuf);
>
> And I'm not sure this is safe either.  The issue isn't that there
> might be another bucket split happening concurrently, but that there
> might be scans that get confused by seeing the bucket split flag set
> before the new bucket is created or the metapage updated.
>

During scan we check for bucket being populated, so this should be safe.

>  Seems like
> it would be safer to keep all the locks for this one.
>

If you feel better that way, I can change it and add a comment saying
something like "we could release lock on bucket being split early as
well, but doing here to be consistent with normal operation"

>
> It seems like a good test to do with this patch would be to set up a
> pgbench test on the master with a hash index replacing the usual btree
> index.  Then, set up a standby and run a read-only pgbench on the
> standby while a read-write pgbench test runs on the master.  Maybe
> you've already tried something like that?
>

Something similar has been done, but will again do with the final version.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [HACKERS] partitioned tables and contrib/sepgsql
Next
From: Fujii Masao
Date:
Subject: Re: [HACKERS] [bug fix] dblink leaks unnamed connections