Worries about delayed-commit semantics - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Worries about delayed-commit semantics |
Date | |
Msg-id | 7862.1182464151@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Worries about delayed-commit semantics
Re: Worries about delayed-commit semantics Re: Worries about delayed-commit semantics Re: Worries about delayed-commit semantics |
List | pgsql-hackers |
I've been reflecting a bit about whether the notion of deferred fsync for transaction commits is really safe. The proposed patch tries to ensure that no consequences of a committed transaction can reach disk before the commit WAL record is fsync'd, but ISTM there are potential holes in what it's doing. In particular the path that concerns me is (1) transaction A commits with deferred fsync; (2) transaction B observes some effect of A (eg, a committed-good tuple); (3) transaction B makes a change that is contingent on the observation. If B's changes were to reach disk in advance of A's commit record, we'd have a risk of logical inconsistency. The patch is doing what it can to prevent *direct* effects of A from reaching disk before the commit record does, but it doesn't (and I think cannot) extend this to indirect effects perpetrated by other transactions. An example of the sort of risk I'm worried about is a REINDEX omitting an index entry for a tuple that it sees as committed dead by A. Now this may be safe anyway, but it requires analysis that I don't recall anyone having put forward. The cases that I can see are: 1. Ordinary WAL-logged change in a shared buffer page. The change will not be allowed to reach disk before the associated WAL record does, and that WAL record must follow A's commit, so we're safe. 2. Non-WAL-logged change in a temp table. Could reach disk in advance of A's commit, but we don't care since temp table contents don't survive crashes anyway. 3. Non-WAL-logged change made via one of the paths we have introduced to avoid WAL overhead for bulk updates. In these cases it's entirely possible for the data to reach disk before A's commit, because B will fsync it down to disk without any sort of interlock, as soon as it finishes the bulk update. However, I believe it's the case that all these paths are designed to write data that no other transaction can see until after B commits. That commit must follow A's in the WAL log, so until it has reached disk, the contents of the bulk-updated file are unimportant after a crash. So I think it's probably all OK, but this is a sufficiently long chain of reasoning that it had better be checked over by multiple people and recorded as part of the design implications of the patch. Does anyone think any of this is wrong, or too fragile to survive future code changes? Are there cases I've missed? BTW: I really dislike the name "transaction guarantee" for the feature; it sounds like marketing-speak, not to mention overpromising what we can deliver. Postgres can't "guarantee" anything in the face of untrustworthy disk hardware, for instance. I'd much rather use names derived from "deferred commit" or "delayed commit" or some such. regards, tom lane
pgsql-hackers by date: