Re: data corruption hazard in reorderbuffer.c - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: data corruption hazard in reorderbuffer.c
Date
Msg-id 7d1ccf60-8a86-76b2-2831-d2b8a4554007@enterprisedb.com
Whole thread Raw
In response to Re: data corruption hazard in reorderbuffer.c  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers

On 7/16/21 1:07 AM, Mark Dilger wrote:
> 
> 
>> On Jul 15, 2021, at 3:32 PM, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>> 
>> I think it's mostly futile to list all the possible issues this
>> might have caused - if you skip arbitrary decoded changes, that can
>> trigger pretty much any bug in reorder buffer. But those bugs can
>> be triggered by various other issues, of course.
> 
> I thought that at first, too, but when I searched for bug reports
> with the given error messages, they all had to do with logical
> replication or logical decoding.  That seems a bit fishy to me.  If
> these bugs were coming from all over the system, why would that be
> so?
> 

No, I'm not suggesting those bugs are coming from all over the system. 
The theory is that there's a bug in logical decoding / reorder buffer, 
with the same symptoms. So it'd affect systems with logical replication.

>> It's hard to say what was the cause, but the "logic" bugs are
>> probably permanent, while the issues triggered by I/O probably
>> disappear after a restart?
> 
> If you mean "logic" bugs like passing an invalid file descriptor to
> close(), then yes, those would be permanent, but I don't believe we
> have any bugs of that kind.
> 

No, by logic bugs I mean cases like failing to update the snapshot, 
losing track of relfilenode changes, etc. due to failing to consider 
some cases etc. As opposed to "I/O errors" where this is caused by 
external events.

> The consequences of I/O errors are not going to go away after
> restart.  On the subscriber side, if logical replication has replayed
> less than a full transaction worth of changes without raising an
> error, the corruption will be permanent.
> 

True, good point. I was thinking about the "could not map relfilenode" 
error, which forces the decoding to restart, and on the retry it's 
unlikely to hit the same I/O error, so it'll succeed. But you're right 
that if it reaches the subscriber and gets committed (e.g. missing a 
row), it's permanent.

>> That being said, I agree this seems like an issue and we should not
>> ignore I/O errors.
> 
> I agree.
> 
>> I'd bet other places using transient files (like sorting or hashagg
>> spilling to disk) has the same issue, although in that case the
>> impact is likely limited to a single query.
> 
> I'm not so convinced.  In most places, the result is ignored for
> close()-type operations only when the prior operation failed and
> we're closing the handle in preparation for raising an error.  There
> are a small number of other places, such as
> pg_import_system_collations and perform_base_backup, which ignore the
> result from closing a handle, but those are reading from the handle,
> not writing to it, so the situation is not comparable.
> 
> I believe the oversight in reorderbuffer.c really is a special case.
> 

Hmm, you might be right. I have only briefly looked at buffile.c and 
tuplestore.c yesterday, and I wasn't sure about the error checking. But 
maybe that works fine, now that I look at it, because we don't really 
use the files after close().

>> I wonder if sync before the close is an appropriate solution,
>> though. It seems rather expensive, and those files are meant to be
>> "temporary" (i.e. we don't keep them over restart). So maybe we
>> could ensure the consistency is a cheaper way - perhaps tracking
>> some sort of checksum for each file, or something like that?
> 
> I'm open to suggestions of what to do, but thinking of these files as
> temporary may be what misleads developers to believe they don't have
> to be treated carefully.  The code was first committed in 2014 and as
> far as I am aware nobody else has complained about this before.  In
> some part that might be because CloseTemporaryFile() is less familiar
> than close() to most developers, and they may be assuming that it
> contains its own error handling and just don't realize that it
> returns an error code just like regular close() does.
> 

I don't think anyone thinks corruption of temporary files would not be 
an issue, but making them fully persistent (which is what fsync would 
do) just to eliminate a piece is not lost seems like an overkill to me. 
And the file may get corrupted even with fsync(), so I'm wondering if 
there's a way to ensure integrity without the fsync (so cheaper).

The scheme I was thinking about is a simple checksum for each BufFile. 
When writing a buffer to disk, update the crc32c checksum, and then 
check it after reading the file (and error-out if it disagrees).

> The point of the reorder buffer is to sort the changes from a single
> transaction so that they can be replayed in order.  The files used
> for the sorting are temporary, but there is nothing temporary about
> the failure to include all the changes in the files, as that will
> have permanent consequences for whoever replays them.  If lucky, the
> attempt to replay the changes will abort because they don't make
> sense, but I've demonstrated to my own satisfaction that nothing
> prevents silent data loss if the failure to write changes happens to
> destroy a complete rather than partial change.
> 

Sure, I agree with all of that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Minimal logical decoding on standbys
Next
From: Japin Li
Date:
Subject: Re: Remove redundant strlen call in ReplicationSlotValidateName