Re: deferred writing of two-phase state files adds fragility - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: deferred writing of two-phase state files adds fragility |
Date | |
Msg-id | CA+TgmoZXtYoybG2Rj5CAUe9hMBBPjx-qRKU8VDK8OU6vs0uEtw@mail.gmail.com Whole thread Raw |
In response to | Re: deferred writing of two-phase state files adds fragility (Andres Freund <andres@anarazel.de>) |
Responses |
Re: deferred writing of two-phase state files adds fragility
|
List | pgsql-hackers |
On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres@anarazel.de> wrote: > Is 2PC really that special in that regard? If the WAL that contains the > checkpoint record itself gets corrupted, you're also in a world of hurt, once > you shut down? Or, to a slightly lower degree, if there's any corrupted > record between the redo pointer and the checkpoint record. And that's > obviously a lot more records than just 2PC COMMIT/RECORD, making the > likelihood of some corruption higher. Sure, that's true. I think my point is just that in a lot of cases where the WAL gets corrupted, you can eventually move on from the problem. Let's say some bad hardware or some annoying "security" software decides to overwrite the most recent CHECKPOINT record. If you go down at that point, you're sad, but if you don't, the server will eventually write a new checkpoint record and then the old, bad one doesn't really matter any more. If you have standbys you may need to rebuild them and if you need logical decoding you may need to recreate subscriptions or something, but since you didn't really end up needing the bad WAL, the fact that it happened doesn't have to cripple the system in any enduring sense. > The only reason it seems somewhat special is that it can more easily be > noticed while the server is running. I think there are two things that make it special. The first is that this is nearly the only case where the primary has a critical dependency on the WAL in the absence of a crash. The second is that, AFAICT, there's no reasonable recovery strategy. > How did this corruption actually come about? Did it actually really just > affect that single WAL segment? Somehow that doesn't seem too likely. I don't know and might not be able to tell you even if I did. > pg_resetwal also won't actually remove the pg_twophase/* files if they did end > up getting created. But that's probably not a too common scenario. Sure, but also, you can remove them yourself. IME, WAL corruption is one of the worst case scenarios in terms of being able to get the database back into reasonable shape. I can advise a customer to remove an entire file if I need to; I have also written code to create fake files to replace real ones that were lost; I have also written code to fix broken heap pages. But when the problem is WAL, how are you supposed to repair it? It's very difficult, I think, bordering on impossible. Does anyone ever try to reconstruct a valid WAL stream to allow replay to continue? AFAICT the only realistic solution is to run pg_resetwal and hope that's good enough. That's often acceptable, but it's not very nice in a case like this. Because you can't checkpoint, you have no way to force the system to flush all dirty pages before shutting it down, which means you may lose a bunch of data if you shut down to run pg_resetwal. But if you don't shut down then you have no way out of the bad state unless you can repair the WAL. I don't think this is going to be a frequent case, so maybe it's not worth doing anything about. But it does seem objectively worse than most failure scenarios, at least to me. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: