Re: [HACKERS] Write Ahead Logging for Hash Indexes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Write Ahead Logging for Hash Indexes
Date
Msg-id CA+TgmoaFEfkQuK3-4cJemOMnNPZLaZvNMF11d+qEyRSFebV6Lg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> +mode anyway).  It would seem natural to complete the split in VACUUM, but since
>> +splitting a bucket might require allocating a new page, it might fail if you
>> +run out of disk space.  That would be bad during VACUUM - the reason for
>> +running VACUUM in the first place might be that you run out of disk space,
>> +and now VACUUM won't finish because you're out of disk space.  In contrast,
>> +an insertion can require enlarging the physical file anyway.
>>
>> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>
> Vacuum just tries to clean the remaining tuples from old bucket.  Only
> insert or split operation tries to complete the split.

Oh, right.  Sorry.

>> +                if (!xlrec.is_primary_bucket_page)
>> +                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>>
>> So, in hashbucketcleanup, the page is registered and therefore an FPI
>> will be taken if this is the first reference to this page since the
>> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
>> exposes us to a torn-page hazard.
>> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
>> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>
> Right, if we use XLogReadBufferForRedoExtended() instead of
> XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
> then it will restore FPI when required and can serve our purpose of
> acquiring clean up lock on primary bucket page.  Yet another way could
> be to store some information like block number, relfilenode, forknum
> (maybe we can get away with some less info) of primary bucket in the
> fixed part of each of these records and then using that read the page
> and then take a cleanup lock.   Now, as in most cases, this buffer
> needs to be registered only in cases when there are multiple overflow
> pages, I think having fixed information might hurt in some cases.

Just because the WAL record gets larger?  I think you could arrange to
record the data only in the cases where it is needed, but I'm also not
sure it would matter that much anyway.  Off-hand, it seems better than
having to mark the primary bucket page dirty (because you have to set
the LSN) every time any page in the chain is modified.

>> +    /*
>> +     * Note that we still update the page even if it was restored from a full
>> +     * page image, because the bucket flag is not included in the image.
>> +     */
>>
>> Really?  Why not?  (Because it's part of the special space?)
>
> Yes, because it's part of special space.  Do you want that to
> mentioned in comments?

Seems like a good idea.  Maybe just change the end to "because the
special space is not included in the image" to make it slightly more
explicit.

>> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
>> guard against concurrent scans?
>
> I don't think so, because regular code takes it on old bucket to
> protect the split from concurrent inserts and cleanup lock on new
> bucket is just for the sake of consistency.

You might be right, but this definitely does create a state on the
standby that can't occur on the master: the bucket being split is
flagged as such, but the bucket being populated doesn't appear to
exist yet.  Maybe that's OK.  It makes me slightly nervous to have the
standby take a weaker lock than the master does, but perhaps it's
harmless.

>> And also hash_xlog_split_complete?
>
> In regular code, we are not taking cleanup lock for this operation.

You're right, sorry.

>> And hash_xlog_delete?  If the regular code path is getting a cleanup
>> lock to protect against concurrent scans, recovery better do it, too.
>>
>
> We are already taking cleanup lock for this, not sure what exactly is
> missed here, can you be more specific?

Never mind, I'm wrong.

>> +    /*
>> +     * There is no harm in releasing the lock on old bucket before new bucket
>> +     * during replay as no other bucket splits can be happening concurrently.
>> +     */
>> +    if (BufferIsValid(oldbuf))
>> +        UnlockReleaseBuffer(oldbuf);
>>
>> And I'm not sure this is safe either.  The issue isn't that there
>> might be another bucket split happening concurrently, but that there
>> might be scans that get confused by seeing the bucket split flag set
>> before the new bucket is created or the metapage updated.
>
> During scan we check for bucket being populated, so this should be safe.

Yeah, I guess so.  This business of the standby taking weaker locks
still seems scary to me, as in hash_xlog_split_allocpage above.

>>  Seems like
>> it would be safer to keep all the locks for this one.
>
> If you feel better that way, I can change it and add a comment saying
> something like "we could release lock on bucket being split early as
> well, but doing here to be consistent with normal operation"

I think that might not be a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] use SQL standard error code for nextval
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Logical replication existing data copy