Re: Online verification of checksums - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Online verification of checksums |
Date | |
Msg-id | 795c567a-936f-c765-8c9d-4a4427dbd243@2ndquadrant.com Whole thread Raw |
In response to | Re: Online verification of checksums (Michael Banck <michael.banck@credativ.de>) |
List | pgsql-hackers |
On 09/17/2018 04:04 PM, Michael Banck wrote: > Hi, > > On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: >> The patch is mostly copying the verification / retry logic from >> basebackup.c, but I think it omitted a rather important detail that >> makes it incorrect in the presence of concurrent writes. >> >> The very first thing basebackup does is this: >> >> startptr = do_pg_start_backup(...); >> >> i.e. it waits for a checkpoint, remembering the LSN. And then when >> checking a page it does this: >> >> if (!PageIsNew(page) && PageGetLSN(page) < startptr) >> { >> ... verify the page checksum >> } >> >> Obviously, pg_verify_checksums can't do that easily because it's >> supposed to run from outside the database instance. > > It reads pg_control anyway, so couldn't we just take > ControlFile->checkPoint? > > Other than that, basebackup.c seems to only look at pages which haven't > been changed since the backup starting checkpoint (see above if > statement). That's reasonable for backups, but is it just as reasonable > for online verification? > I suppose we might refresh the checkpoint LSN regularly, and use the most recent one. On large/busy databases that would allow checking larger part of the database. >> But the startptr detail is pretty important because it supports this >> retry reasoning: >> >> /* >> * Retry the block on the first failure. It's >> * possible that we read the first 4K page of the >> * block just before postgres updated the entire block >> * so it ends up looking torn to us. We only need to >> * retry once because the LSN should be updated to >> * something we can ignore on the next pass. If the >> * error happens again then it is a true validation >> * failure. >> */ >> >> Imagine the 8kB page as two 4kB pages, with the initial state being >> [A1,A2] and another process over-writing it with [B1,B2]. If you read >> the 8kB page, what states can you see? >> >> I don't think POSIX provides any guarantees about atomicity of the write >> calls (and even if it does, the filesystems on Linux don't seem to). So >> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes >> of the two versions, depending on timing. Well, torn pages ... >> >> Pretty much the only thing you can rely on is that when one process does >> >> write([B1,B2]) >> >> the other process may first read [A1,B2], but the next read will return >> [B1,B2] (or possibly newer data, if there was another write). It will >> not read the "stale" A1 again. >> >> The basebackup relies on this kinda implicitly - on the retry it'll >> notice the LSN changed (thanks to the startptr check), and the page will >> be skipped entirely. This is pretty important, because the new page >> might be torn in some other way. >> >> The pg_verify_checksum apparently ignores this skip logic, because on >> the retry it simply re-reads the page again, verifies the checksum and >> reports an error. Which is broken, because the newly read page might be >> torn again due to a concurrent write. > > Well, ok. > >> So IMHO this should do something similar to basebackup - check the page >> LSN, and if it changed then skip the page. >> >> I'm afraid this requires using the last checkpoint LSN, the way startptr >> is used in basebackup. In particular we can't simply remember LSN from >> the first read, because we might actually read [B1,A2] on the first try, >> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be >> torn in various other ways, not necessarily at the 4kB boundary - it >> might be torn right after the LSN, for example). > > I'd prefer to come up with a plan where we don't just give up once we > see a new LSN, if possible. If I run a modified pg_verify_checksums > which skips on newer pages in a tight benchmark, basically everything > gets skipped as checkpoints don't happen often enough. > But in that case the checksums are verified when reading the buffer into shared buffers, it's not like we don't notice the checksum error at all. We are interested in the pages that have not been read/written for an extended period time. So I think this is not a problem. > So how about we do check every page, but if one fails on retry, and the > LSN is newer than the checkpoint, we then skip it? Is that logic sound? > Hmmm, maybe. > In any case, if we decide we really should skip the page if it is newer > than the checkpoint, I think it makes sense to track those skipped pages > and print their number out at the end, if there are any. > I agree it might be useful to know how many pages were skipped, and how many actually passed the checksum check. >> FWIW I also don't understand the purpose of pg_sleep(), it does not seem >> to protect against anything, really. > > Well, I've noticed that without it I get sporadic checksum failures on > reread, so I've added it to make them go away. It was certainly a > phenomenological decision that I am happy to trade for a better one. > My guess is this happened because both the read and re-read completed during the same write. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: