data corruption hazard in reorderbuffer.c - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | data corruption hazard in reorderbuffer.c |
Date | |
Msg-id | 4CBA7034-3E44-48D1-9F73-20080B76E3C2@enterprisedb.com Whole thread Raw |
Responses |
Re: data corruption hazard in reorderbuffer.c
|
List | pgsql-hackers |
Hackers, I believe there is a hazard in reorderbuffer.c if a call to write() buffers data rather than flushing it to disk, only tofail when flushing the data during close(). The relevant code is in ReorderBufferSerializeTXN(), which iterates over changesfor a transaction, opening transient files as needed for the changes to be written, writing the transaction changesto the transient files using ReorderBufferSerializeChange(), and closing the files when finished using CloseTransientFile(),the return code from which is ignored. If ReorderBufferSerializeChange() were fsync()ing the writes,then ignoring the failures on close() would likely be ok, or at least in line with what we do elsewhere. But as faras I can see, no call to sync the file is performed. I expect a failure here could result in a partially written change in the transient file, perhaps preceded or followed byadditional complete or partial changes. Perhaps even worse, a failure could result in a change not being written at all(rather than partially), potentially with preceding and following changes written intact, with no indication that onechange is missing. Upon testing, both of these expectations appear to be true. Skipping some writes while not others easily creates a varietyof failures, and for brevity I won't post a patch to demonstrate that here. The following code change causes wholerather than partial changes to be written or skipped. After applying this change and running the tests in contrib/test_decoding/,"test toast" shows that an UPDATE command has been silently skipped. diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 7378beb684..a6c60b81c9 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -108,6 +108,7 @@ #include "utils/rel.h" #include "utils/relfilenodemap.h" +static bool lose_writes_until_close = false; /* entry for a hash table we use to map from xid to our transaction state */ typedef struct ReorderBufferTXNByIdEnt @@ -3523,6 +3524,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) Size spilled = 0; Size size = txn->size; + lose_writes_until_close = false; + elog(DEBUG2, "spill %u changes in XID %u to disk", (uint32) txn->nentries_mem, txn->xid); @@ -3552,7 +3555,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) char path[MAXPGPATH]; if (fd != -1) + { CloseTransientFile(fd); + lose_writes_until_close = !lose_writes_until_close; + } XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size); @@ -3600,6 +3606,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) if (fd != -1) CloseTransientFile(fd); + + lose_writes_until_close = false; } /* @@ -3790,7 +3798,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, errno = 0; pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE); - if (write(fd, rb->outbuf, ondisk->size) != ondisk->size) + + if (lose_writes_until_close) + ; /* silently do nothing with buffer contents */ + else if (write(fd, rb->outbuf, ondisk->size) != ondisk->size) { int save_errno = errno; The attached patch catches the failures on close(), but to really fix this properly, we should call pg_fsync() before close(). Any thoughts on this? It seems to add a fair amount of filesystem burden to add all the extra fsync activity, though I admitto having not benchmarked that yet. Perhaps doing something like Thomas's work for commit dee663f7843 where closingfiles is delayed so that fewer syncs are needed? I'm not sure how much that would help here, and would like feedbackbefore pursuing anything of that sort. The relevant code exists back as far as the 9_4_STABLE branch, where commit b89e151054a from 2014 first appears. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: