Thread: deferred writing of two-phase state files adds fragility

deferred writing of two-phase state files adds fragility

From
Robert Haas
Date:
Let's suppose that you execute PREPARE TRANSACTION and, before the
next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets
corrupted on disk. This might seem like an unlikely scenario, and it
is, but we saw a case at EDB not too long ago.

To a first approximation, the world ends. You can't execute COMMIT
TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the
prepared transaction. You also can't checkpoint, because that requires
writing a twophase state file for the prepared transaction, and that's
not possible because the WAL can't be read. What you have is a mostly
working system, except that it's going to bloat over time because the
prepared transaction is going to hold back the VACUUM horizon. And you
basically have no way out of that problem, because there's no tool
that says "I understand that my database is going to be corrupted,
that's ok, just forget about that twophase transaction".

If you shut down the database, then things become truly awful. You
can't get a clean shutdown because you can't checkpoint, so you're
going to resume recovery from the last checkpoint before the problem
happened, find the corrupted WAL, and fail. As long as your database
was up, you at least had the possibility of getting all of your data
out of it by running pg_dump, as long as you can survive the amount of
time that's going to take. And, if you did do that, you wouldn't even
have corruption. But once your database has gone down, you can't get
it back up again without running pg_resetwal. Running pg_resetwal is
not very appealing here -- first because now you do have corruption
whereas before the shutdown you didn't, and second because the last
checkpoint could already be a long time in the past, depending on how
quickly you realized you have this problem.

Before 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71, things would not have
been quite so bad. Checkpoints wouldn't fail, so you might never even
realize you had a problem, or you might just need to rebuild your
standbys. If you had corruption in a different place, like the
twophase file itself, you could simply shut down cleanly, remove the
twophase file, and start back up. I'm not quite sure whether that's
equivalent to a forced abort of the twophase transaction or whether it
might leave you with some latent corruption, but I suspect the
problems you'll have will be pretty tame compared to what happens in
the scenario described above.

Just to be clear, I am not suggesting that we should revert that
commit. I'm actually not sure whether we should change anything at
all, but I'm not very comfortable with the status quo, either. It's
unavoidable that the database will sometimes end up in a bad state --
Murphy's law, entropy, or whatever you want to call it guarantees
that. But I like it a lot better when there's something that I can
reasonably do to get the database OUT of that bad state, and in this
situation nothing works -- or at least, nothing that I could think of
works. It would be nice to improve on that somehow, if anybody has a
good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: deferred writing of two-phase state files adds fragility

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Let's suppose that you execute PREPARE TRANSACTION and, before the
> next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets
> corrupted on disk. This might seem like an unlikely scenario, and it
> is, but we saw a case at EDB not too long ago.

> To a first approximation, the world ends.

Ugh.

> You can't execute COMMIT
> TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the
> prepared transaction.

Could we fix it so ROLLBACK TRANSACTION removes the GID from the
list of prepared xacts that need to be written out?  Then we'd
no longer have a pending requirement to read the broken WAL record.

> ... I'm not quite sure whether that's
> equivalent to a forced abort of the twophase transaction or whether it
> might leave you with some latent corruption, but I suspect the
> problems you'll have will be pretty tame compared to what happens in
> the scenario described above.

It should be fully equivalent to a database crash immediately before
committing an ordinary transaction.  (I know various people keep
proposing that we load UNDO requirements onto transaction abort,
but I continue to think that's an awful idea, for precisely this
reason: you can't guarantee being able to execute the UNDO steps.
In any case, we don't have such requirements yet.)

            regards, tom lane



Re: deferred writing of two-phase state files adds fragility

From
Robert Haas
Date:
On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > You can't execute COMMIT
> > TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the
> > prepared transaction.
>
> Could we fix it so ROLLBACK TRANSACTION removes the GID from the
> list of prepared xacts that need to be written out?  Then we'd
> no longer have a pending requirement to read the broken WAL record.

That would be nice, but I'm not sure that it's possible. As currently
implemented, FinishPreparedTransaction() always reads the two-phase
state data either from the two-phase file or the WAL, whether it's
committing or rolling back. One might expect the commit or rollback to
proceed purely on the basis of in-memory state, but I think that does
not work because nsubxacts might be greater than
PGPROC_MAX_CACHED_SUBXIDS. Even when it isn't, we have no shared
memory record of abortrels or abortstats.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: deferred writing of two-phase state files adds fragility

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Could we fix it so ROLLBACK TRANSACTION removes the GID from the
>> list of prepared xacts that need to be written out?  Then we'd
>> no longer have a pending requirement to read the broken WAL record.

> That would be nice, but I'm not sure that it's possible. As currently
> implemented, FinishPreparedTransaction() always reads the two-phase
> state data either from the two-phase file or the WAL, whether it's
> committing or rolling back.

I'm not following.  FinishPreparedTransaction is not what's preventing
checkpoints or holding back the VACUUM horizon.  What is doing that
is the in-memory fake PGPROC representing the prepared transaction
(I forget the exact terminology).  I'm suggesting that we could have
some way to nuke one of those without properly cleaning up the
prepared xact.  Maybe it'd need to be invoked via a different command
than ROLLBACK TRANSACTION.

            regards, tom lane



Re: deferred writing of two-phase state files adds fragility

From
Robert Haas
Date:
On Wed, Dec 4, 2024 at 12:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Dec 4, 2024 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Could we fix it so ROLLBACK TRANSACTION removes the GID from the
> >> list of prepared xacts that need to be written out?  Then we'd
> >> no longer have a pending requirement to read the broken WAL record.
>
> > That would be nice, but I'm not sure that it's possible. As currently
> > implemented, FinishPreparedTransaction() always reads the two-phase
> > state data either from the two-phase file or the WAL, whether it's
> > committing or rolling back.
>
> I'm not following.  FinishPreparedTransaction is not what's preventing
> checkpoints or holding back the VACUUM horizon.  What is doing that
> is the in-memory fake PGPROC representing the prepared transaction
> (I forget the exact terminology).  I'm suggesting that we could have
> some way to nuke one of those without properly cleaning up the
> prepared xact.  Maybe it'd need to be invoked via a different command
> than ROLLBACK TRANSACTION.

Yes, that we could do. Perhaps it could be added to pg_surgery.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: deferred writing of two-phase state files adds fragility

From
Andres Freund
Date:
Hi,

On 2024-12-04 12:04:47 -0500, Robert Haas wrote:
> Let's suppose that you execute PREPARE TRANSACTION and, before the
> next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets
> corrupted on disk. This might seem like an unlikely scenario, and it
> is, but we saw a case at EDB not too long ago.
> 
> To a first approximation, the world ends. You can't execute COMMIT
> TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the
> prepared transaction.

Is 2PC really that special in that regard? If the WAL that contains the
checkpoint record itself gets corrupted, you're also in a world of hurt, once
you shut down?  Or, to a slightly lower degree, if there's any corrupted
record between the redo pointer and the checkpoint record. And that's
obviously a lot more records than just 2PC COMMIT/RECORD, making the
likelihood of some corruption higher.

The only reason it seems somewhat special is that it can more easily be
noticed while the server is running.


How did this corruption actually come about?  Did it actually really just
affect that single WAL segment? Somehow that doesn't seem too likely.


> You also can't checkpoint, because that requires
> writing a twophase state file for the prepared transaction, and that's
> not possible because the WAL can't be read. What you have is a mostly
> working system, except that it's going to bloat over time because the
> prepared transaction is going to hold back the VACUUM horizon. And you
> basically have no way out of that problem, because there's no tool
> that says "I understand that my database is going to be corrupted,
> that's ok, just forget about that twophase transaction".

> If you shut down the database, then things become truly awful. You
> can't get a clean shutdown because you can't checkpoint, so you're
> going to resume recovery from the last checkpoint before the problem
> happened, find the corrupted WAL, and fail. As long as your database
> was up, you at least had the possibility of getting all of your data
> out of it by running pg_dump, as long as you can survive the amount of
> time that's going to take. And, if you did do that, you wouldn't even
> have corruption. But once your database has gone down, you can't get
> it back up again without running pg_resetwal. Running pg_resetwal is
> not very appealing here -- first because now you do have corruption
> whereas before the shutdown you didn't, and second because the last
> checkpoint could already be a long time in the past, depending on how
> quickly you realized you have this problem.

pg_resetwal also won't actually remove the pg_twophase/* files if they did end
up getting created. But that's probably not a too common scenario.

Greetings,

Andres Freund



Re: deferred writing of two-phase state files adds fragility

From
Robert Haas
Date:
On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:
> Is 2PC really that special in that regard? If the WAL that contains the
> checkpoint record itself gets corrupted, you're also in a world of hurt, once
> you shut down?  Or, to a slightly lower degree, if there's any corrupted
> record between the redo pointer and the checkpoint record. And that's
> obviously a lot more records than just 2PC COMMIT/RECORD, making the
> likelihood of some corruption higher.

Sure, that's true. I think my point is just that in a lot of cases
where the WAL gets corrupted, you can eventually move on from the
problem. Let's say some bad hardware or some annoying "security"
software decides to overwrite the most recent CHECKPOINT record. If
you go down at that point, you're sad, but if you don't, the server
will eventually write a new checkpoint record and then the old, bad
one doesn't really matter any more. If you have standbys you may need
to rebuild them and if you need logical decoding you may need to
recreate subscriptions or something, but since you didn't really end
up needing the bad WAL, the fact that it happened doesn't have to
cripple the system in any enduring sense.

> The only reason it seems somewhat special is that it can more easily be
> noticed while the server is running.

I think there are two things that make it special. The first is that
this is nearly the only case where the primary has a critical
dependency on the WAL in the absence of a crash. The second is that,
AFAICT, there's no reasonable recovery strategy.

> How did this corruption actually come about?  Did it actually really just
> affect that single WAL segment? Somehow that doesn't seem too likely.

I don't know and might not be able to tell you even if I did.

> pg_resetwal also won't actually remove the pg_twophase/* files if they did end
> up getting created. But that's probably not a too common scenario.

Sure, but also, you can remove them yourself. IME, WAL corruption is
one of the worst case scenarios in terms of being able to get the
database back into reasonable shape. I can advise a customer to remove
an entire file if I need to; I have also written code to create fake
files to replace real ones that were lost; I have also written code to
fix broken heap pages. But when the problem is WAL, how are you
supposed to repair it? It's very difficult, I think, bordering on
impossible. Does anyone ever try to reconstruct a valid WAL stream to
allow replay to continue? AFAICT the only realistic solution is to run
pg_resetwal and hope that's good enough. That's often acceptable, but
it's not very nice in a case like this. Because you can't checkpoint,
you have no way to force the system to flush all dirty pages before
shutting it down, which means you may lose a bunch of data if you shut
down to run pg_resetwal. But if you don't shut down then you have no
way out of the bad state unless you can repair the WAL.

I don't think this is going to be a frequent case, so maybe it's not
worth doing anything about. But it does seem objectively worse than
most failure scenarios, at least to me.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: deferred writing of two-phase state files adds fragility

From
Andres Freund
Date:
Hi,

On 2024-12-05 11:21:12 -0500, Robert Haas wrote:
> On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:
> > How did this corruption actually come about?  Did it actually really just
> > affect that single WAL segment? Somehow that doesn't seem too likely.
> 
> I don't know and might not be able to tell you even if I did.

Understandable. I was mainly asking because I don't really see it making sense
to make significant investments into coping with WAL getting arbitrarily
corrupted. But if there's some systematic issue leading to corruption, adding
more infrastructure to detect problems seems more worthwhile / realistic.

Greetings,

Andres Freund