data corruption hazard in reorderbuffer.c - Mailing list pgsql-hackers

From Mark Dilger
Subject data corruption hazard in reorderbuffer.c
Date
Msg-id 4CBA7034-3E44-48D1-9F73-20080B76E3C2@enterprisedb.com
Whole thread Raw
Responses Re: data corruption hazard in reorderbuffer.c
List pgsql-hackers
Hackers,

I believe there is a hazard in reorderbuffer.c if a call to write() buffers data rather than flushing it to disk, only
tofail when flushing the data during close().  The relevant code is in ReorderBufferSerializeTXN(), which iterates over
changesfor a transaction, opening transient files as needed for the changes to be written, writing the transaction
changesto the transient files using ReorderBufferSerializeChange(), and closing the files when finished using
CloseTransientFile(),the return code from which is ignored.  If ReorderBufferSerializeChange() were fsync()ing the
writes,then ignoring the failures on close() would likely be ok, or at least in line with what we do elsewhere.  But as
faras I can see, no call to sync the file is performed. 

I expect a failure here could result in a partially written change in the transient file, perhaps preceded or followed
byadditional complete or partial changes.  Perhaps even worse, a failure could result in a change not being written at
all(rather than partially), potentially with preceding and following changes written intact, with no indication that
onechange is missing. 

Upon testing, both of these expectations appear to be true.  Skipping some writes while not others easily creates a
varietyof failures, and for brevity I won't post a patch to demonstrate that here.  The following code change causes
wholerather than partial changes to be written or skipped.  After applying this change and running the tests in
contrib/test_decoding/,"test toast" shows that an UPDATE command has been silently skipped. 

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7378beb684..a6c60b81c9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -108,6 +108,7 @@
 #include "utils/rel.h"
 #include "utils/relfilenodemap.h"

+static bool lose_writes_until_close = false;

 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -3523,6 +3524,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
        Size            spilled = 0;
        Size            size = txn->size;

+       lose_writes_until_close = false;
+
        elog(DEBUG2, "spill %u changes in XID %u to disk",
                 (uint32) txn->nentries_mem, txn->xid);

@@ -3552,7 +3555,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
                        char            path[MAXPGPATH];

                        if (fd != -1)
+                       {
                                CloseTransientFile(fd);
+                               lose_writes_until_close = !lose_writes_until_close;
+                       }

                        XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);

@@ -3600,6 +3606,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

        if (fd != -1)
                CloseTransientFile(fd);
+
+       lose_writes_until_close = false;
 }

 /*
@@ -3790,7 +3798,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,

        errno = 0;
        pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
-       if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
+
+       if (lose_writes_until_close)
+               ;       /* silently do nothing with buffer contents */
+       else if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
        {
                int                     save_errno = errno;


The attached patch catches the failures on close(), but to really fix this properly, we should call pg_fsync() before
close().

Any thoughts on this?  It seems to add a fair amount of filesystem burden to add all the extra fsync activity, though I
admitto having not benchmarked that yet.  Perhaps doing something like Thomas's work for commit dee663f7843 where
closingfiles is delayed so that fewer syncs are needed?  I'm not sure how much that would help here, and would like
feedbackbefore pursuing anything of that sort. 

The relevant code exists back as far as the 9_4_STABLE branch, where commit b89e151054a from 2014 first appears.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Avoid repeated PQfnumber() in pg_dump
Next
From: "David G. Johnston"
Date:
Subject: Re: pg_upgrade does not upgrade pg_stat_statements properly