Re: Sketch of a fix for that truncation data corruption issue - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Sketch of a fix for that truncation data corruption issue
Date
Msg-id 20181211235137.7uqvvn6thlapg75t@alap3.anarazel.de
Whole thread Raw
In response to Sketch of a fix for that truncation data corruption issue  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2018-12-10 15:38:55 -0500, Tom Lane wrote:
> Reflecting on that some more, it seems to me that we're never going to
> get to a solution that everybody finds acceptable without some rather
> significant restructuring at the buffer-access level.

I'm thinking about your proposal RN.  Here's what I had, previously, in
mind for fixing this, on a somewhat higher level than your
proposal. More with an angle towards getting rid of the AEL, both to
allow the truncation to happen with concurrent writers, and to avoid the
HS cancellation issues, than aiming for robustness.

I'm listing this mostly as fodder-for-thought, as I didn't aim for
robustness, it doesn't quite achieve the same guarantees you seem to be
angling for. But if we potentially are going for a HEAD only solution, I
think it's reasonable to see if we can combine ideas.



For truncation:

1) Conditionally take extension lock

2) Determine current size

3) From the back of the relation, one-by-one, probe buffer manager for
   each page.  If the page is *not* in the buffer manager, install
   buffer descriptor that's marked invalid, lock exclusively.  If *in*
   buffers, try to conditionally acquire a cleanup lock, goto 4) if not
   available.

   If more than ~128 pages afterwards, also goto 4).

4) If 3) did not lock any pages, give up.

5) Truncate FSM/VM.

6) In a critical section: WAL log truncation for all the locked pages,
   truncate pages, and drop buffers + locks.  After this every attempt
   to read in such a page would fail.


This obviously has the fairly significant drawback that truncations can
only happen in relatively small increments, about a megabyte (limited by
the number of concurrently held lwlocks). But because no locks are
required that block writes, that's much less of an issue than
previously.  I assume we'd still want an additional heuristic that
doesn't even start trying to do truncation if there's not more trailing
space - but we could just do that without a lock and recheck afterwards.

A second drawback is that during the truncation other processes would
need to wait, uninterruptibly in an lwlock no less, till the truncation
is finished, if they try to read one of the empty blocks.

One issue is that:
> One issue with not holding AEL is that there are race conditions
> wherein readers might attempt to fetch pages beyond the file EOF
> (for example, a seqscan that started before the truncation began
> would attempt to read up to the old EOF, or a very slow indexscan
> might try to follow a pointer from a since-deleted index entry).
> So we would have to change things to regard that as a non-error
> condition.  That might be fine, but it does seem like it's giving up
> some error detection capability.  If anyone's sufficiently worried
> about that, we could keep the lock level at AEL; but personally
> I think reducing the lock level is worth enough to be willing to make
> that compromise.

becomes a bit more of a prominent problem, because it's not just the
read access that now needs to return a non-fatal error when a
page-after-EOF is being read, but that doesn't strike me as a large
additional problem.


> After doodling for awhile with that in mind, it seems like we might be
> able to fix it by introducing a new buffer state "truncation pending" that
> we'd apply to empty-but-dirty buffers that we don't want to write out.
> Aside from fixing the data corruption issue, this sketch has the
> significant benefit that we don't need to take AccessExclusiveLock
> anymore to do a vacuum truncation: it seems sufficient to lock out
> would-be writers of the table.  VACUUM's truncation logic would go
> like this:

It's possible that combining your "truncation pending" flag with my
sketch above would allow to avoid the "waiting for lwlock" issue.


> 7. Issue ftruncate(s), working backwards if the truncation spans multiple
> segment files.  Don't error out on syscall failure, just stop truncating
> and note boundary of successful truncation.

ISTM on the back-branches the least we should do is to make the
vacuum truncations happen in a critical sections. That sucks, but it's
sure better than weird corruption due to on-disk state diverging from
the in-memory state.


> Another area that's possibly worthy of concern is whether a reader
> could attempt to re-set bits in the FSM or VM for to-be-deleted pages
> after the truncation of those files.  We might need some interlock to
> prevent that.  Or, perhaps, just re-truncate them after the main
> truncation?  Or maybe it doesn't matter if there's bogus data in
> the maps.

It'd might matter for the VM - if that says "all frozen", we'd be in
trouble later.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: printf ordering issues?
Next
From: Alexander Korotkov
Date:
Subject: Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock