Re: data corruption hazard in reorderbuffer.c - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: data corruption hazard in reorderbuffer.c |
Date | |
Msg-id | 7d1ccf60-8a86-76b2-2831-d2b8a4554007@enterprisedb.com Whole thread Raw |
In response to | Re: data corruption hazard in reorderbuffer.c (Mark Dilger <mark.dilger@enterprisedb.com>) |
List | pgsql-hackers |
On 7/16/21 1:07 AM, Mark Dilger wrote: > > >> On Jul 15, 2021, at 3:32 PM, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> I think it's mostly futile to list all the possible issues this >> might have caused - if you skip arbitrary decoded changes, that can >> trigger pretty much any bug in reorder buffer. But those bugs can >> be triggered by various other issues, of course. > > I thought that at first, too, but when I searched for bug reports > with the given error messages, they all had to do with logical > replication or logical decoding. That seems a bit fishy to me. If > these bugs were coming from all over the system, why would that be > so? > No, I'm not suggesting those bugs are coming from all over the system. The theory is that there's a bug in logical decoding / reorder buffer, with the same symptoms. So it'd affect systems with logical replication. >> It's hard to say what was the cause, but the "logic" bugs are >> probably permanent, while the issues triggered by I/O probably >> disappear after a restart? > > If you mean "logic" bugs like passing an invalid file descriptor to > close(), then yes, those would be permanent, but I don't believe we > have any bugs of that kind. > No, by logic bugs I mean cases like failing to update the snapshot, losing track of relfilenode changes, etc. due to failing to consider some cases etc. As opposed to "I/O errors" where this is caused by external events. > The consequences of I/O errors are not going to go away after > restart. On the subscriber side, if logical replication has replayed > less than a full transaction worth of changes without raising an > error, the corruption will be permanent. > True, good point. I was thinking about the "could not map relfilenode" error, which forces the decoding to restart, and on the retry it's unlikely to hit the same I/O error, so it'll succeed. But you're right that if it reaches the subscriber and gets committed (e.g. missing a row), it's permanent. >> That being said, I agree this seems like an issue and we should not >> ignore I/O errors. > > I agree. > >> I'd bet other places using transient files (like sorting or hashagg >> spilling to disk) has the same issue, although in that case the >> impact is likely limited to a single query. > > I'm not so convinced. In most places, the result is ignored for > close()-type operations only when the prior operation failed and > we're closing the handle in preparation for raising an error. There > are a small number of other places, such as > pg_import_system_collations and perform_base_backup, which ignore the > result from closing a handle, but those are reading from the handle, > not writing to it, so the situation is not comparable. > > I believe the oversight in reorderbuffer.c really is a special case. > Hmm, you might be right. I have only briefly looked at buffile.c and tuplestore.c yesterday, and I wasn't sure about the error checking. But maybe that works fine, now that I look at it, because we don't really use the files after close(). >> I wonder if sync before the close is an appropriate solution, >> though. It seems rather expensive, and those files are meant to be >> "temporary" (i.e. we don't keep them over restart). So maybe we >> could ensure the consistency is a cheaper way - perhaps tracking >> some sort of checksum for each file, or something like that? > > I'm open to suggestions of what to do, but thinking of these files as > temporary may be what misleads developers to believe they don't have > to be treated carefully. The code was first committed in 2014 and as > far as I am aware nobody else has complained about this before. In > some part that might be because CloseTemporaryFile() is less familiar > than close() to most developers, and they may be assuming that it > contains its own error handling and just don't realize that it > returns an error code just like regular close() does. > I don't think anyone thinks corruption of temporary files would not be an issue, but making them fully persistent (which is what fsync would do) just to eliminate a piece is not lost seems like an overkill to me. And the file may get corrupted even with fsync(), so I'm wondering if there's a way to ensure integrity without the fsync (so cheaper). The scheme I was thinking about is a simple checksum for each BufFile. When writing a buffer to disk, update the crc32c checksum, and then check it after reading the file (and error-out if it disagrees). > The point of the reorder buffer is to sort the changes from a single > transaction so that they can be replayed in order. The files used > for the sorting are temporary, but there is nothing temporary about > the failure to include all the changes in the files, as that will > have permanent consequences for whoever replays them. If lucky, the > attempt to replay the changes will abort because they don't make > sense, but I've demonstrated to my own satisfaction that nothing > prevents silent data loss if the failure to write changes happens to > destroy a complete rather than partial change. > Sure, I agree with all of that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: