Re: Sketch of a fix for that truncation data corruption issue - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Sketch of a fix for that truncation data corruption issue |
Date | |
Msg-id | 20181211235137.7uqvvn6thlapg75t@alap3.anarazel.de Whole thread Raw |
In response to | Sketch of a fix for that truncation data corruption issue (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
Hi, On 2018-12-10 15:38:55 -0500, Tom Lane wrote: > Reflecting on that some more, it seems to me that we're never going to > get to a solution that everybody finds acceptable without some rather > significant restructuring at the buffer-access level. I'm thinking about your proposal RN. Here's what I had, previously, in mind for fixing this, on a somewhat higher level than your proposal. More with an angle towards getting rid of the AEL, both to allow the truncation to happen with concurrent writers, and to avoid the HS cancellation issues, than aiming for robustness. I'm listing this mostly as fodder-for-thought, as I didn't aim for robustness, it doesn't quite achieve the same guarantees you seem to be angling for. But if we potentially are going for a HEAD only solution, I think it's reasonable to see if we can combine ideas. For truncation: 1) Conditionally take extension lock 2) Determine current size 3) From the back of the relation, one-by-one, probe buffer manager for each page. If the page is *not* in the buffer manager, install buffer descriptor that's marked invalid, lock exclusively. If *in* buffers, try to conditionally acquire a cleanup lock, goto 4) if not available. If more than ~128 pages afterwards, also goto 4). 4) If 3) did not lock any pages, give up. 5) Truncate FSM/VM. 6) In a critical section: WAL log truncation for all the locked pages, truncate pages, and drop buffers + locks. After this every attempt to read in such a page would fail. This obviously has the fairly significant drawback that truncations can only happen in relatively small increments, about a megabyte (limited by the number of concurrently held lwlocks). But because no locks are required that block writes, that's much less of an issue than previously. I assume we'd still want an additional heuristic that doesn't even start trying to do truncation if there's not more trailing space - but we could just do that without a lock and recheck afterwards. A second drawback is that during the truncation other processes would need to wait, uninterruptibly in an lwlock no less, till the truncation is finished, if they try to read one of the empty blocks. One issue is that: > One issue with not holding AEL is that there are race conditions > wherein readers might attempt to fetch pages beyond the file EOF > (for example, a seqscan that started before the truncation began > would attempt to read up to the old EOF, or a very slow indexscan > might try to follow a pointer from a since-deleted index entry). > So we would have to change things to regard that as a non-error > condition. That might be fine, but it does seem like it's giving up > some error detection capability. If anyone's sufficiently worried > about that, we could keep the lock level at AEL; but personally > I think reducing the lock level is worth enough to be willing to make > that compromise. becomes a bit more of a prominent problem, because it's not just the read access that now needs to return a non-fatal error when a page-after-EOF is being read, but that doesn't strike me as a large additional problem. > After doodling for awhile with that in mind, it seems like we might be > able to fix it by introducing a new buffer state "truncation pending" that > we'd apply to empty-but-dirty buffers that we don't want to write out. > Aside from fixing the data corruption issue, this sketch has the > significant benefit that we don't need to take AccessExclusiveLock > anymore to do a vacuum truncation: it seems sufficient to lock out > would-be writers of the table. VACUUM's truncation logic would go > like this: It's possible that combining your "truncation pending" flag with my sketch above would allow to avoid the "waiting for lwlock" issue. > 7. Issue ftruncate(s), working backwards if the truncation spans multiple > segment files. Don't error out on syscall failure, just stop truncating > and note boundary of successful truncation. ISTM on the back-branches the least we should do is to make the vacuum truncations happen in a critical sections. That sucks, but it's sure better than weird corruption due to on-disk state diverging from the in-memory state. > Another area that's possibly worthy of concern is whether a reader > could attempt to re-set bits in the FSM or VM for to-be-deleted pages > after the truncation of those files. We might need some interlock to > prevent that. Or, perhaps, just re-truncate them after the main > truncation? Or maybe it doesn't matter if there's bogus data in > the maps. It'd might matter for the VM - if that says "all frozen", we'd be in trouble later. Greetings, Andres Freund
pgsql-hackers by date: