Thread: deferred writing of two-phase state files adds fragility
Let's suppose that you execute PREPARE TRANSACTION and, before the next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets corrupted on disk. This might seem like an unlikely scenario, and it is, but we saw a case at EDB not too long ago. To a first approximation, the world ends. You can't execute COMMIT TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the prepared transaction. You also can't checkpoint, because that requires writing a twophase state file for the prepared transaction, and that's not possible because the WAL can't be read. What you have is a mostly working system, except that it's going to bloat over time because the prepared transaction is going to hold back the VACUUM horizon. And you basically have no way out of that problem, because there's no tool that says "I understand that my database is going to be corrupted, that's ok, just forget about that twophase transaction". If you shut down the database, then things become truly awful. You can't get a clean shutdown because you can't checkpoint, so you're going to resume recovery from the last checkpoint before the problem happened, find the corrupted WAL, and fail. As long as your database was up, you at least had the possibility of getting all of your data out of it by running pg_dump, as long as you can survive the amount of time that's going to take. And, if you did do that, you wouldn't even have corruption. But once your database has gone down, you can't get it back up again without running pg_resetwal. Running pg_resetwal is not very appealing here -- first because now you do have corruption whereas before the shutdown you didn't, and second because the last checkpoint could already be a long time in the past, depending on how quickly you realized you have this problem. Before 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71, things would not have been quite so bad. Checkpoints wouldn't fail, so you might never even realize you had a problem, or you might just need to rebuild your standbys. If you had corruption in a different place, like the twophase file itself, you could simply shut down cleanly, remove the twophase file, and start back up. I'm not quite sure whether that's equivalent to a forced abort of the twophase transaction or whether it might leave you with some latent corruption, but I suspect the problems you'll have will be pretty tame compared to what happens in the scenario described above. Just to be clear, I am not suggesting that we should revert that commit. I'm actually not sure whether we should change anything at all, but I'm not very comfortable with the status quo, either. It's unavoidable that the database will sometimes end up in a bad state -- Murphy's law, entropy, or whatever you want to call it guarantees that. But I like it a lot better when there's something that I can reasonably do to get the database OUT of that bad state, and in this situation nothing works -- or at least, nothing that I could think of works. It would be nice to improve on that somehow, if anybody has a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Let's suppose that you execute PREPARE TRANSACTION and, before the > next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets > corrupted on disk. This might seem like an unlikely scenario, and it > is, but we saw a case at EDB not too long ago. > To a first approximation, the world ends. Ugh. > You can't execute COMMIT > TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the > prepared transaction. Could we fix it so ROLLBACK TRANSACTION removes the GID from the list of prepared xacts that need to be written out? Then we'd no longer have a pending requirement to read the broken WAL record. > ... I'm not quite sure whether that's > equivalent to a forced abort of the twophase transaction or whether it > might leave you with some latent corruption, but I suspect the > problems you'll have will be pretty tame compared to what happens in > the scenario described above. It should be fully equivalent to a database crash immediately before committing an ordinary transaction. (I know various people keep proposing that we load UNDO requirements onto transaction abort, but I continue to think that's an awful idea, for precisely this reason: you can't guarantee being able to execute the UNDO steps. In any case, we don't have such requirements yet.) regards, tom lane
On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > You can't execute COMMIT > > TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the > > prepared transaction. > > Could we fix it so ROLLBACK TRANSACTION removes the GID from the > list of prepared xacts that need to be written out? Then we'd > no longer have a pending requirement to read the broken WAL record. That would be nice, but I'm not sure that it's possible. As currently implemented, FinishPreparedTransaction() always reads the two-phase state data either from the two-phase file or the WAL, whether it's committing or rolling back. One might expect the commit or rollback to proceed purely on the basis of in-memory state, but I think that does not work because nsubxacts might be greater than PGPROC_MAX_CACHED_SUBXIDS. Even when it isn't, we have no shared memory record of abortrels or abortstats. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Could we fix it so ROLLBACK TRANSACTION removes the GID from the >> list of prepared xacts that need to be written out? Then we'd >> no longer have a pending requirement to read the broken WAL record. > That would be nice, but I'm not sure that it's possible. As currently > implemented, FinishPreparedTransaction() always reads the two-phase > state data either from the two-phase file or the WAL, whether it's > committing or rolling back. I'm not following. FinishPreparedTransaction is not what's preventing checkpoints or holding back the VACUUM horizon. What is doing that is the in-memory fake PGPROC representing the prepared transaction (I forget the exact terminology). I'm suggesting that we could have some way to nuke one of those without properly cleaning up the prepared xact. Maybe it'd need to be invoked via a different command than ROLLBACK TRANSACTION. regards, tom lane
On Wed, Dec 4, 2024 at 12:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Could we fix it so ROLLBACK TRANSACTION removes the GID from the > >> list of prepared xacts that need to be written out? Then we'd > >> no longer have a pending requirement to read the broken WAL record. > > > That would be nice, but I'm not sure that it's possible. As currently > > implemented, FinishPreparedTransaction() always reads the two-phase > > state data either from the two-phase file or the WAL, whether it's > > committing or rolling back. > > I'm not following. FinishPreparedTransaction is not what's preventing > checkpoints or holding back the VACUUM horizon. What is doing that > is the in-memory fake PGPROC representing the prepared transaction > (I forget the exact terminology). I'm suggesting that we could have > some way to nuke one of those without properly cleaning up the > prepared xact. Maybe it'd need to be invoked via a different command > than ROLLBACK TRANSACTION. Yes, that we could do. Perhaps it could be added to pg_surgery. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-12-04 12:04:47 -0500, Robert Haas wrote: > Let's suppose that you execute PREPARE TRANSACTION and, before the > next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets > corrupted on disk. This might seem like an unlikely scenario, and it > is, but we saw a case at EDB not too long ago. > > To a first approximation, the world ends. You can't execute COMMIT > TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the > prepared transaction. Is 2PC really that special in that regard? If the WAL that contains the checkpoint record itself gets corrupted, you're also in a world of hurt, once you shut down? Or, to a slightly lower degree, if there's any corrupted record between the redo pointer and the checkpoint record. And that's obviously a lot more records than just 2PC COMMIT/RECORD, making the likelihood of some corruption higher. The only reason it seems somewhat special is that it can more easily be noticed while the server is running. How did this corruption actually come about? Did it actually really just affect that single WAL segment? Somehow that doesn't seem too likely. > You also can't checkpoint, because that requires > writing a twophase state file for the prepared transaction, and that's > not possible because the WAL can't be read. What you have is a mostly > working system, except that it's going to bloat over time because the > prepared transaction is going to hold back the VACUUM horizon. And you > basically have no way out of that problem, because there's no tool > that says "I understand that my database is going to be corrupted, > that's ok, just forget about that twophase transaction". > If you shut down the database, then things become truly awful. You > can't get a clean shutdown because you can't checkpoint, so you're > going to resume recovery from the last checkpoint before the problem > happened, find the corrupted WAL, and fail. As long as your database > was up, you at least had the possibility of getting all of your data > out of it by running pg_dump, as long as you can survive the amount of > time that's going to take. And, if you did do that, you wouldn't even > have corruption. But once your database has gone down, you can't get > it back up again without running pg_resetwal. Running pg_resetwal is > not very appealing here -- first because now you do have corruption > whereas before the shutdown you didn't, and second because the last > checkpoint could already be a long time in the past, depending on how > quickly you realized you have this problem. pg_resetwal also won't actually remove the pg_twophase/* files if they did end up getting created. But that's probably not a too common scenario. Greetings, Andres Freund
On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres@anarazel.de> wrote: > Is 2PC really that special in that regard? If the WAL that contains the > checkpoint record itself gets corrupted, you're also in a world of hurt, once > you shut down? Or, to a slightly lower degree, if there's any corrupted > record between the redo pointer and the checkpoint record. And that's > obviously a lot more records than just 2PC COMMIT/RECORD, making the > likelihood of some corruption higher. Sure, that's true. I think my point is just that in a lot of cases where the WAL gets corrupted, you can eventually move on from the problem. Let's say some bad hardware or some annoying "security" software decides to overwrite the most recent CHECKPOINT record. If you go down at that point, you're sad, but if you don't, the server will eventually write a new checkpoint record and then the old, bad one doesn't really matter any more. If you have standbys you may need to rebuild them and if you need logical decoding you may need to recreate subscriptions or something, but since you didn't really end up needing the bad WAL, the fact that it happened doesn't have to cripple the system in any enduring sense. > The only reason it seems somewhat special is that it can more easily be > noticed while the server is running. I think there are two things that make it special. The first is that this is nearly the only case where the primary has a critical dependency on the WAL in the absence of a crash. The second is that, AFAICT, there's no reasonable recovery strategy. > How did this corruption actually come about? Did it actually really just > affect that single WAL segment? Somehow that doesn't seem too likely. I don't know and might not be able to tell you even if I did. > pg_resetwal also won't actually remove the pg_twophase/* files if they did end > up getting created. But that's probably not a too common scenario. Sure, but also, you can remove them yourself. IME, WAL corruption is one of the worst case scenarios in terms of being able to get the database back into reasonable shape. I can advise a customer to remove an entire file if I need to; I have also written code to create fake files to replace real ones that were lost; I have also written code to fix broken heap pages. But when the problem is WAL, how are you supposed to repair it? It's very difficult, I think, bordering on impossible. Does anyone ever try to reconstruct a valid WAL stream to allow replay to continue? AFAICT the only realistic solution is to run pg_resetwal and hope that's good enough. That's often acceptable, but it's not very nice in a case like this. Because you can't checkpoint, you have no way to force the system to flush all dirty pages before shutting it down, which means you may lose a bunch of data if you shut down to run pg_resetwal. But if you don't shut down then you have no way out of the bad state unless you can repair the WAL. I don't think this is going to be a frequent case, so maybe it's not worth doing anything about. But it does seem objectively worse than most failure scenarios, at least to me. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-12-05 11:21:12 -0500, Robert Haas wrote: > On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres@anarazel.de> wrote: > > How did this corruption actually come about? Did it actually really just > > affect that single WAL segment? Somehow that doesn't seem too likely. > > I don't know and might not be able to tell you even if I did. Understandable. I was mainly asking because I don't really see it making sense to make significant investments into coping with WAL getting arbitrarily corrupted. But if there's some systematic issue leading to corruption, adding more infrastructure to detect problems seems more worthwhile / realistic. Greetings, Andres Freund