Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | CAOBaU_Yh3Dmwp9fMB0RAOGU47JL4w+mWhK8jTXqzJA-_AewrBw@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Online checksums verification in the backend
|
List | pgsql-hackers |
Hi, On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > The code does IO while holding the buffer mapping lock. That seems > *entirely* unacceptable to me. That basically locks 1/128 of shared > buffers against concurrent mapping changes, while reading data that is > likely not to be on disk? Seriously? The initial implementation had a different approach, reading the buffer once without holding the buffer mapping lock (which could lead to some false positive in some unlikely scenario), and only if a corruption is detected the read is done once again *while holding the buffer mapping lock* to ensure it's not a false positive. Some benchmarking showed that the performance was worse, so we dropped that optimisation. Should we go back to something like that or do you have a better way to ensure a consistent read of a buffer which isn't in shared buffers? > a pin is cheap. Holding the partition lock is not. > The justification in the in-shared-buffers case seems to completely > mis-judge costs too: > * Found it. Now, retrieve its state to know what to do with it, and > * release the pin immediately. We do so to limit overhead as much as > * possible. We keep the shared LWLock on the target buffer mapping > * partition for now, so this buffer cannot be evicted, and we acquire > * an I/O Lock on the buffer as we may need to read its contents from > * disk. > a pin is cheap. Holding the partition lock is not. I clearly did a poor job in that case. Will fix. > Also, using char[BLCKSZ] as a buffer isn't ok. This should use > PGAlignedBlock: I wasn't aware of it, I will fix. > > LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); > > buf_state = LockBufHdr(bufdesc); > > UnlockBufHdr(bufdesc, buf_state); > > > > /* If the page is dirty or invalid, skip it */ > > if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0) > > This is weird as well. What is this supposed to do? Just locking and > unlocking a buffer header doesn't do squat? There's no guarantee that > the flags haven't changed by this point, so you could just as well not > acquire the buffer header lock. This is using the same approach as e.g. WaitIO() to get the state. I agree that the state can change after the buffer header lock has been released, but I think that's something out of scope. The only guarantee that we can give is that the database (or subset of relations checked) was healthy at the time the check was started, provided that your cluster survive the checkpoint happening after the check ended. I don't see how we can do better than that. > Also, why are pages without a valid tag ignored? I can follow the > argument for skipping it in the DIRTY case, but that doesn't apply for > BM_TAG_VALID? AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. Those shouldn't be entirely initialized yet, and they'll be eventually written and flushed. > Also, uh, I don't think the locking of the buffer table provides you > with the full guarantees CheckBuffer() seems to assume: > > * Check the state of a buffer without loading it into the shared buffers. To > * avoid torn pages and possible false positives when reading data, a shared > * LWLock is taken on the target buffer pool partition mapping, and we check > * if the page is in shared buffers or not. An I/O lock is taken on the block > * to prevent any concurrent activity from happening. > > this doesn't actually prevent all concurrent write IO, unless you hold > an appropriate lock on the relation. There's a few places that use > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > for, but that'd need a pretty detailed explanation as to when it's safe > to use CheckBuffer() for which blocks. AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged relation, during creation. Those relations shouldn't be visible to the caller snapshot, so it should be safe. I can add a comment for that if I'm not mistaken. For concurrent smgrextend(), we read the relation size at the beginning of the function, so we shouldn't read newly allocated blocks. But you're right that it's still possible to get the size that includes a newly allocated block that can be concurrently written. We can avoid that be holding a LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?
pgsql-hackers by date: