Thread: We have got a serious problem with pg_clog/WAL synchronization
While investigating Satoshi Okada's bug report here http://archives.postgresql.org/pgsql-hackers/2004-08/msg00510.php I realized that it actually represents a crash-safety risk that has existed since 7.2. <lecture> Allow me to refresh your memory about the principles of write-ahead logging. The one that everyone remembers is "a WAL entry must hit disk before any of the data changes it describes". But there is a different constraint that must also be met, which is "a checkpoint must flush all data changes of preceding WAL entries". In detail, a checkpoint does: 1. Note the current end-of-WAL position (where the next WAL entry will be made). This is the checkpoint's "REDO" pointer. 2. Flush all dirty buffers to disk, and fsync all changes. 3. Add a checkpoint record to WAL, and flush it to disk. (Note that when there is concurrent activity, other WAL records may be added to WAL between steps 1 and 3, so that the checkpoint record's physical location is later than its REDO pointer. This is okay. The added records are logically after the checkpoint, even though physically located before it in the WAL data.) If we now suffer a crash, log replay will be executed starting at the REDO point. Since records before the REDO point will not be replayed, it is critical that the "flush" operations in step 2 have written all the effects of such records to disk. </lecture> Satoshi-san's bug report shows a way to cause the system to sometimes violate this constraint. In particular, what I saw was a transaction commit WAL record that was just before the REDO pointer of a checkpoint, but the pg_clog status update for it had not been flushed to disk by the checkpoint. The reason this is possible is that RecordTransactionCommit first writes the commit record, and fsyncs it, and only then goes and makes the pg_clog status update in shared memory. There is thus a window for a checkpoint to start, note its REDO point (after the commit), and flush the current contents of pg_clog buffers out to disk before the transaction has updated its state in pg_clog. This has been broken since the original design of pg_clog in 7.2 :-(. I fear I have to take the blame for it. (Just to add insult to injury: if you enable commit_delay then the sleep occurs during the window of vulnerability.) What I am thinking of doing to fix the problem is to introduce a new LWLock that RecordTransactionCommit will take a shared lock on before writing its WAL record, and not release until it has updated pg_clog. Checkpoint start will acquire the lock exclusively just long enough to do its step 1 (establish REDO point). This is slightly annoying since it means one more high-traffic lock to grab during commit, but I don't see any other good solution. We will certainly have to back-patch this into 7.4 and I suppose we should think about issuing new 7.3 and 7.2 releases as well. regards, tom lane
Tom Lane wrote: >While investigating Satoshi Okada's bug report here > > >... > >What I am thinking of doing to fix the problem is to introduce >a new LWLock that RecordTransactionCommit will take a shared lock on >before writing its WAL record, and not release until it has updated >pg_clog. Checkpoint start will acquire the lock exclusively just long >enough to do its step 1 (establish REDO point). This is slightly >annoying since it means one more high-traffic lock to grab during >commit, but I don't see any other good solution. > Please forgive me to give my potentially irrelevant comments. I am not too familiar with the internals of the postgresql. It seems to me this is an interesting phenomena of interactions between frequent events of transaction commits and infrequent events of system checkpoints. A potential alternative solution to adding a new shared lock to the frequent commit operation is to let the infrequent checkpoint operation take more overhead. I suppose acquiring/releasing an extra lock for each commit would incur extra performance overhead, even when the lock is not contented. On the other hand, let the checkpoing operation acquire some existing locks (exclusively) to effectively disallowing committing transactions to interfere with the checkpoint process might be a better solution since it incur higher overhead only when necessary. Of course, this after all might be "pre-mature" optimization. Just my $0.02. Thanks, -Min -- Rong: Life is all about running after a busy agenda! How frustrating! Min: That's right! You can either deal with things as quick as possible or to create less items on the agenda to begin with.
"Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > It seems to me this is an interesting phenomena of interactions between > frequent events of transaction commits and infrequent events of system > checkpoints. A potential alternative solution to adding a new shared > lock to the frequent commit operation is to let the infrequent > checkpoint operation take more overhead. I suppose acquiring/releasing > an extra lock for each commit would incur extra performance overhead, > even when the lock is not contented. On the other hand, let the > checkpoing operation acquire some existing locks (exclusively) to > effectively disallowing committing transactions to interfere with the > checkpoint process might be a better solution since it incur higher > overhead only when necessary. Unfortunately, there isn't any pre-existing lock that will serve. A transaction that is between XLogInsert'ing its COMMIT record and updating the shared pg_clog data area does not hold any lock that could be used to prevent a checkpoint from starting. (Or it didn't until yesterday's patch, anyway.) I looked briefly at reorganizing the existing code so that we'd do the COMMIT XLogInsert while we're holding lock on the shared pg_clog data, which would solve the problem without adding any new lock acquisition. But this seemed extremely messy to do. Also it would be optimizing transaction commit at the cost of pessimizing other uses of pg_clog, which might have to wait longer to get at the shared data. Adding the new lock has the advantage that we can be sure it's not blocking anything we don't want it to block. Thanks for thinking about the problem though ... regards, tom lane
Tom Lane wrote: > >Unfortunately, there isn't any pre-existing lock that will serve. >A transaction that is between XLogInsert'ing its COMMIT record and >updating the shared pg_clog data area does not hold any lock that >could be used to prevent a checkpoint from starting. (Or it didn't >until yesterday's patch, anyway.) > >I looked briefly at reorganizing the existing code so that we'd do the >COMMIT XLogInsert while we're holding lock on the shared pg_clog data, >which would solve the problem without adding any new lock acquisition. >But this seemed extremely messy to do. Also it would be optimizing >transaction commit at the cost of pessimizing other uses of pg_clog, >which might have to wait longer to get at the shared data. Adding the >new lock has the advantage that we can be sure it's not blocking >anything we don't want it to block. > >Thanks for thinking about the problem though ... > > You are welcome.