Re: Worries about delayed-commit semantics - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Worries about delayed-commit semantics
Date
Msg-id 1182502188.9276.106.camel@silverbirch.site
Whole thread Raw
In response to Worries about delayed-commit semantics  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, 2007-06-21 at 18:15 -0400, Tom Lane wrote:
> I've been reflecting a bit about whether the notion of deferred fsync
> for transaction commits is really safe.  The proposed patch tries to
> ensure that no consequences of a committed transaction can reach disk
> before the commit WAL record is fsync'd, but ISTM there are potential
> holes in what it's doing.  In particular the path that concerns me is
> 
> (1) transaction A commits with deferred fsync;
> 
> (2) transaction B observes some effect of A (eg, a committed-good tuple);
> 
> (3) transaction B makes a change that is contingent on the observation.
> 
> If B's changes were to reach disk in advance of A's commit record, we'd
> have a risk of logical inconsistency.  

B's changes cannot reach disk before B's commit record. That is the
existing WAL-before-data rule implemented by the buffer manager.

If B can see A's changes, then A has written a commit record to the log
that is definitely before B's commit record. So B's commit will also
commit A's changes to WAL when it flushes at EOX. So whether A is a
guaranteed transaction or not, B can always rely on those changes.

I agree this feels unsafe when you first think about it, and was the
reason for me taking months before publishing the idea.

> The patch is doing what it can
> to prevent *direct* effects of A from reaching disk before the commit
> record does, but it doesn't (and I think cannot) extend this to indirect
> effects perpetrated by other transactions.  An example of the sort of
> risk I'm worried about is a REINDEX omitting an index entry for a tuple
> that it sees as committed dead by A.
> 
> Now this may be safe anyway, but it requires analysis that I don't
> recall anyone having put forward.  The cases that I can see are:
> 
> 1. Ordinary WAL-logged change in a shared buffer page.  The change will
> not be allowed to reach disk before the associated WAL record does, and
> that WAL record must follow A's commit, so we're safe.
> 
> 2. Non-WAL-logged change in a temp table.  Could reach disk in advance
> of A's commit, but we don't care since temp table contents don't survive
> crashes anyway.
> 
> 3. Non-WAL-logged change made via one of the paths we have introduced
> to avoid WAL overhead for bulk updates.  In these cases it's entirely
> possible for the data to reach disk before A's commit, because B will
> fsync it down to disk without any sort of interlock, as soon as it
> finishes the bulk update.  However, I believe it's the case that all
> these paths are designed to write data that no other transaction can see
> until after B commits.  That commit must follow A's in the WAL log,
> so until it has reached disk, the contents of the bulk-updated file
> are unimportant after a crash.
> 
> So I think it's probably all OK, but this is a sufficiently long chain
> of reasoning that it had better be checked over by multiple people and
> recorded as part of the design implications of the patch.  Does anyone
> think any of this is wrong, or too fragile to survive future code
> changes?  Are there cases I've missed?

I've done the analysis, but perhaps I should finish the docs now to aid
with review of the patch on the points you make.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: "Simon Riggs"
Date:
Subject: Re: Worries about delayed-commit semantics
Next
From: Teodor Sigaev
Date:
Subject: Re: tsearch in core patch