Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Date
Msg-id CAFBsxsH2DEB67gnrh272EF7pJonc29b8i8yRM7W3XqpyL6UpKQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound  (Aleksander Alekseev <aleksander@timescale.com>)
List pgsql-hackers
On Tue, Mar 21, 2023 at 6:44 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with 0004:

1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way to v11.
2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back as we safely can (and test).

(...and future work, so not part of the CF here) 3. Tell the user what caused the problem, instead of saying "go figure it out yourself".

> > In swapping this topic back in my head, I also saw [2] where Robert suggested
> >
> > "that old prepared transactions and stale replication
> > slots should be emphasized more prominently.  Maybe something like:
> >
> > HINT:  Commit or roll back old prepared transactions, drop stale
> > replication slots, or kill long-running sessions.
> > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
>
> It looks like the hint regarding replication slots was added at some
> point. Currently we have:
>
> ```
> errhint( [...]
>     "You might also need to commit or roll back old prepared
> transactions, or drop stale replication slots.")));
> ```

Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.

> Please let me know if you think
> we should also add a suggestion to kill long-running sessions, etc.

+1 for also adding that.

- errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
+ errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"",

I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" doesn't  really communicate anything useful. The people who understand exactly what that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place, and might even know enough to ignore the hint.

I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's more useful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'd like to give some idea of what they can and cannot do.

+     Previously it was required to stop the postmaster and VACUUM the database
+     in a single-user mode. There is no need to use a single-user mode anymore.

I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication and disables safeguards against wraparound. (Other bad things too, but these are easily understandable for users)

Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that would help speed up resolution.

I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc. If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that...

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: running logical replication as the subscription owner
Next
From: Andrey Borodin
Date:
Subject: Re: regression coverage gaps for gist and hash indexes