Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | 20201030025834.6spoioldj3jiqif3@alap3.anarazel.de Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Julien Rouhaud <rjuju123@gmail.com>) |
Responses |
Re: Online checksums verification in the backend
|
List | pgsql-hackers |
Hi, On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote: > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > > The code does IO while holding the buffer mapping lock. That seems > > *entirely* unacceptable to me. That basically locks 1/128 of shared > > buffers against concurrent mapping changes, while reading data that is > > likely not to be on disk? Seriously? > > The initial implementation had a different approach, reading the buffer once > without holding the buffer mapping lock (which could lead to some false > positive in some unlikely scenario), and only if a corruption is detected the > read is done once again *while holding the buffer mapping lock* to ensure it's > not a false positive. Some benchmarking showed that the performance was worse, > so we dropped that optimisation. Should we go back to something like that or > do you have a better way to ensure a consistent read of a buffer which isn't in > shared buffers? I suspect that you're gonna need something quite different than what the function is doing right now. Not because such a method will be faster in isolation, but because there's a chance to have it correct and not have a significant performance impact onto the rest of the system. I've not thought about it in detail yet. Is suspect you'll need to ensure there is a valid entry in the buffer mapping table for the buffer you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry you're going to prevent concurrent IO from starting until your part is done. > > Also, why are pages without a valid tag ignored? I can follow the > > argument for skipping it in the DIRTY case, but that doesn't apply for > > BM_TAG_VALID? > > AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. > Those shouldn't > be entirely initialized yet, and they'll be eventually written and flushed. When a page is being read there's a period when the buffer is without BM_TAG_VALID. It's quite possible that the locking prevents this case from being reachable - but in that case you shouldn't just accept it as something to be skipped... > > Also, uh, I don't think the locking of the buffer table provides you > > with the full guarantees CheckBuffer() seems to assume: > > > > * Check the state of a buffer without loading it into the shared buffers. To > > * avoid torn pages and possible false positives when reading data, a shared > > * LWLock is taken on the target buffer pool partition mapping, and we check > > * if the page is in shared buffers or not. An I/O lock is taken on the block > > * to prevent any concurrent activity from happening. > > > > this doesn't actually prevent all concurrent write IO, unless you hold > > an appropriate lock on the relation. There's a few places that use > > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > > for, but that'd need a pretty detailed explanation as to when it's safe > > to use CheckBuffer() for which blocks. > > AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged > relation, during creation. That may be the case right in core right now, but for one, there definitely are extensions going through smgrwrite() without using the buffer pool. Essentially, what you are saying is that the introduction of CheckBuffer() altered what smgrwrite() is allowed to be used for, without having discussed or documented that. Before this an AM/extension could just use smgrwrite() to write data not in shared buffers, as long as a locking scheme is used that prevents multiple backends from doing that at the same time (trivially: AccessExclusiveLock). > Those relations shouldn't be visible to the caller > snapshot, so it should be safe. I can add a comment for that if I'm not > mistaken. There's no comment warning that you shouldn't use CheckBuffer() to check every buffer in shared buffers, or every relfilenode on disk. The latter would be quite a reasonable thing, given it'd avoid needing to connect to every database etc. > For concurrent smgrextend(), we read the relation size at the beginning of the > function, so we shouldn't read newly allocated blocks. But you're right that > it's still possible to get the size that includes a newly allocated block > that can be concurrently written. We can avoid that be holding a > LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok? That could possibly work - but currently CheckBuffer() doesn't get a relation, nor are the comments explaining that it has to be a relation in the current database or anything. I hadn't yet looked at the caller - I just started looking at CheckBuffer() this because it caused compilation failures after rebasing my aio branch onto master (there's no IO locks anymore). Looking at the caller: - This is not a concurrency safe pattern: /* Check if relation exists. leaving if there is no such relation */ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) return; relation = relation_open(relid, AccessShareLock); there's a pretty obvious time-to-check-time-to-use danger here. - pg_relation_check_pages()'s docs say "valid enough to safely be loaded into the server's shared buffers". I think that's overpromising by a lot. It sounds like it verifies that the page cannot cause a crash or such when accessed - but it obviously does no such thing. - Why does check_one_relation() *silently* ignore when it's being passed a temporary table, or a relkind without storage? - I don't think it's good that check_one_relation() releases relation locks after access, but I know that others think that's fine (I think it's only fine for catalog relations). - I realize permission to pg_relation_check_pages() is not granted to non-superusers by default, but shouldn't it still perform relation access checks? - why does check_relation_fork() pstrdup the path? Greetings, Andres Freund
pgsql-hackers by date: