Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Online checksums verification in the backend
Date
Msg-id CAOBaU_Yh3Dmwp9fMB0RAOGU47JL4w+mWhK8jTXqzJA-_AewrBw@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Andres Freund <andres@anarazel.de>)
Responses Re: Online checksums verification in the backend
List pgsql-hackers
Hi,

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
> The code does IO while holding the buffer mapping lock. That seems
> *entirely* unacceptable to me. That basically locks 1/128 of shared
> buffers against concurrent mapping changes, while reading data that is
> likely not to be on disk?  Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive.  Some benchmarking showed that the performance was worse,
so we dropped that optimisation.  Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

> a pin is cheap. Holding the partition lock is not.

> The justification in the in-shared-buffers case seems to completely
> mis-judge costs too:
>                  * Found it.  Now, retrieve its state to know what to do with it, and
>                  * release the pin immediately.  We do so to limit overhead as much as
>                  * possible.  We keep the shared LWLock on the target buffer mapping
>                  * partition for now, so this buffer cannot be evicted, and we acquire
>                  * an I/O Lock on the buffer as we may need to read its contents from
>                  * disk.
> a pin is cheap. Holding the partition lock is not.

I clearly did a poor job in that case.  Will fix.

> Also, using char[BLCKSZ] as a buffer isn't ok. This should use
> PGAlignedBlock:

I wasn't aware of it, I will fix.

> >               LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
> >               buf_state = LockBufHdr(bufdesc);
> >               UnlockBufHdr(bufdesc, buf_state);
> >
> >               /* If the page is dirty or invalid, skip it */
> >               if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)
>
> This is weird as well. What is this supposed to do? Just locking and
> unlocking a buffer header doesn't do squat? There's no guarantee that
> the flags haven't changed by this point, so you could just as well not
> acquire the buffer header lock.

This is using the same approach as e.g. WaitIO() to get the state.  I agree
that the state can change after the buffer header lock has been released, but
I think that's something out of scope.  The only guarantee that we can give is
that the database (or subset of relations checked) was healthy at the time the
check was started, provided that your cluster survive the checkpoint happening
after the check ended.  I don't see how we can do better than that.

> Also, why are pages without a valid tag ignored? I can follow the
> argument for skipping it in the DIRTY case, but that doesn't apply for
> BM_TAG_VALID?

AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
Those shouldn't
be entirely initialized yet, and they'll be eventually written and flushed.

> Also, uh, I don't think the locking of the buffer table provides you
> with the full guarantees CheckBuffer() seems to assume:
>
>  * Check the state of a buffer without loading it into the shared buffers. To
>  * avoid torn pages and possible false positives when reading data, a shared
>  * LWLock is taken on the target buffer pool partition mapping, and we check
>  * if the page is in shared buffers or not.  An I/O lock is taken on the block
>  * to prevent any concurrent activity from happening.
>
> this doesn't actually prevent all concurrent write IO, unless you hold
> an appropriate lock on the relation. There's a few places that use
> smgrwrite()/smgrextend() to write out data bypassing shared buffers.
>
> Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> for, but that'd need a pretty detailed explanation as to when it's safe
> to use CheckBuffer() for which blocks.

AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
relation, during creation.  Those relations shouldn't be visible to the caller
snapshot, so it should be safe.  I can add a comment for that if I'm not
mistaken.

For concurrent smgrextend(), we read the relation size at the beginning of the
function, so we shouldn't read newly allocated blocks.  But you're right that
it's still possible to get the size that includes a newly allocated block
that can be concurrently written.  We can avoid that be holding a
LOCKTAG_RELATION_EXTEND lock when reading the relation size.  Would that be ok?



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Boundary value check in lazy_tid_reaped()
Next
From: Michael Paquier
Date:
Subject: Re: [patch] Fix checksum verification in base backups for zero page headers