Re: data corruption hazard in reorderbuffer.c - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: data corruption hazard in reorderbuffer.c
Date
Msg-id D9CF71D1-89DB-401C-8558-D814937B70EA@enterprisedb.com
Whole thread Raw
In response to Re: data corruption hazard in reorderbuffer.c  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: data corruption hazard in reorderbuffer.c  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers

> On Jul 15, 2021, at 3:32 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> I think it's mostly futile to list all the possible issues this might have caused - if you skip arbitrary decoded
changes,that can trigger pretty much any bug in reorder buffer. But those bugs can be triggered by various other
issues,of course. 

I thought that at first, too, but when I searched for bug reports with the given error messages, they all had to do
withlogical replication or logical decoding.  That seems a bit fishy to me.  If these bugs were coming from all over
thesystem, why would that be so? 

> It's hard to say what was the cause, but the "logic" bugs are probably permanent, while the issues triggered by I/O
probablydisappear after a restart? 

If you mean "logic" bugs like passing an invalid file descriptor to close(), then yes, those would be permanent, but I
don'tbelieve we have any bugs of that kind. 

The consequences of I/O errors are not going to go away after restart.  On the subscriber side, if logical replication
hasreplayed less than a full transaction worth of changes without raising an error, the corruption will be permanent. 

> That being said, I agree this seems like an issue and we should not ignore I/O errors.

I agree.

> I'd bet other places using transient files (like sorting or hashagg spilling to disk) has the same issue, although in
thatcase the impact is likely limited to a single query. 

I'm not so convinced.  In most places, the result is ignored for close()-type operations only when the prior operation
failedand we're closing the handle in preparation for raising an error.  There are a small number of other places, such
aspg_import_system_collations and perform_base_backup, which ignore the result from closing a handle, but those are
readingfrom the handle, not writing to it, so the situation is not comparable. 

I believe the oversight in reorderbuffer.c really is a special case.

> I wonder if sync before the close is an appropriate solution, though. It seems rather expensive, and those files are
meantto be "temporary" (i.e. we don't keep them over restart). So maybe we could ensure the consistency is a cheaper
way- perhaps tracking some sort of checksum for each file, or something like that? 

I'm open to suggestions of what to do, but thinking of these files as temporary may be what misleads developers to
believethey don't have to be treated carefully.  The code was first committed in 2014 and as far as I am aware nobody
elsehas complained about this before.  In some part that might be because CloseTemporaryFile() is less familiar than
close()to most developers, and they may be assuming that it contains its own error handling and just don't realize that
itreturns an error code just like regular close() does. 

The point of the reorder buffer is to sort the changes from a single transaction so that they can be replayed in order.
The files used for the sorting are temporary, but there is nothing temporary about the failure to include all the
changesin the files, as that will have permanent consequences for whoever replays them.  If lucky, the attempt to
replaythe changes will abort because they don't make sense, but I've demonstrated to my own satisfaction that nothing
preventssilent data loss if the failure to write changes happens to destroy a complete rather than partial change. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: data corruption hazard in reorderbuffer.c
Next
From: Jan Wieck
Date:
Subject: Re: pg_upgrade does not upgrade pg_stat_statements properly