Thread: [PATCH] Clarify the behavior of the system when approaching XID wraparound
[PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi hackers, While playing with 64-bit XIDs [1] my attention was drawn by the following statement in the docs [2]: """ If these warnings are ignored, the system will shut down and refuse to start any new transactions once there are fewer than three million transactions left until wraparound. """ I decided to check this. Unfortunately it can't be done easily e.g. by modifying ShmemVariableCache->nextXid in gdb, because the system will PANIC with something like "could not access status of transaction 12345". Hopefully [3] will change the situation someday. Meanwhile I choose the hard way. In one session I did: ``` CREATE TABLE phonebook( "id" SERIAL PRIMARY KEY NOT NULL, "name" NAME NOT NULL, "phone" INT NOT NULL); BEGIN; INSERT INTO phonebook VALUES (1, 'Alex', 123); -- don't commit! ``` Then I did the following: ``` echo "SELECT pg_current_xact_id();" > t.sql pgbench -j 8 -c 8 -f t.sql -T 86400 eax ``` After 20-24 hours on the typical hardware (perhaps faster if only I didn't forget to use `synchronous_commit = off`) pgbench will use up the XID pool. The old tuples can't be frozen because the transaction we created in the beginning is still in progress. So now we can observe what actually happens when the system reaches xidStopLimit. Firstly, the system doesn't shutdown as the documentation says. Secondly, it executes new transactions just fine as long as these transactions don't allocate new XIDs. XIDs are allocated not for every transaction but rather lazily, when needed (see backend_xid in pg_stat_activity). A transaction doesn't need an assigned XID for checking the visibility of the tuples. Rather it uses xmin horizon, and only when using an isolation level above READ COMMITTED, see backend_xmin in pg_stat_activity. Assigning a xmin horizon doesn't increase nextXid. As a result, PostgreSQL can still execute read-only transactions even after reaching xidStopLimit. Similarly to how it can do this on hot standby replicas without having conflicts with the leader server. Thirdly, if there was a transaction created before reaching xidStopLimit, it will continue to execute after reaching xidStopLimit, and it can be successfully committed. All in all, the actual behavior is far from "system shutdown" and "refusing to start any new transactions". It's closer to entering read-only mode, similarly to what hot standbys allow to do. The proposed patchset changes the documentation and the error messages accordingly, making them less misleading. 0001 corrects the documentation but doesn't touch the code. 0002 and 0003 correct the messages shown when approaching xidWrapLimit and xidWarnLimit accordingly. Thoughts? [1]: https://commitfest.postgresql.org/41/3594/ [2]: https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND [3]: https://commitfest.postgresql.org/41/3729/ -- Best regards, Aleksander Alekseev
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi hackers, > The proposed patchset changes the documentation and the error messages > accordingly, making them less misleading. 0001 corrects the > documentation but doesn't touch the code. 0002 and 0003 correct the > messages shown when approaching xidWrapLimit and xidWarnLimit > accordingly. A colleague of mine, Oleksii Kliukin, pointed out that the recommendation about running VACUUM in a single-user mode is also outdated, as it was previously reported in [1]. I didn't believe it at first and decided to double-check: ``` =# select * from phonebook; id | name | phone ----+---------+------- 1 | Alex | 123 5 | Charlie | 789 2 | Bob | 456 6 | Ololo | 789 (4 rows) =# insert into phonebook values (7, 'Trololo', 987); ERROR: database is not accepting commands to avoid wraparound data loss in database "template1" HINT: Stop the postmaster and vacuum that database in single-user mode. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. =# VACUUM FREEZE; VACUUM =# insert into phonebook values (7, 'Trololo', 987); INSERT 0 1 =# SELECT current_setting('wal_level'); current_setting ----------------- logical ``` Unfortunately the [1] discussion went nowhere. So I figured it would be appropriate to add corresponding changes to the proposed patchset since it's relevant and is registered in the CF app already. PFA patchset v2 which now also includes 0004. [1]: https://www.postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com -- Best regards, Aleksander Alekseev
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound (stop telling users to "vacuum that database in single-user mode")
From
Justin Pryzby
Date:
On Mon, Jan 16, 2023 at 03:50:57PM +0300, Aleksander Alekseev wrote: > Hi hackers, > > > The proposed patchset changes the documentation and the error messages > > accordingly, making them less misleading. 0001 corrects the > > documentation but doesn't touch the code. 0002 and 0003 correct the > > messages shown when approaching xidWrapLimit and xidWarnLimit > > accordingly. > > A colleague of mine, Oleksii Kliukin, pointed out that the > recommendation about running VACUUM in a single-user mode is also > outdated, as it was previously reported in [1]. I didn't believe it at > first and decided to double-check: and again at: https://www.postgresql.org/message-id/flat/CA%2BTgmoYPfofQmRtUan%3DA3aWE9wFsJaOFr%2BW_ys2pPkNPr-2FZw%40mail.gmail.com#e7dd25fdcd171c5775f3f9e3f86b2082 > Unfortunately the [1] discussion went nowhere. likewise... > So I figured it would be appropriate to add corresponding changes to > the proposed patchset since it's relevant and is registered in the CF > app already. PFA patchset v2 which now also includes 0004. > > [1]: > https://www.postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com I suggest to resend this with a title like the 2021 thread [1] (I was unable to find this just now when I looked) | doc: stop telling users to "vacuum that database in single-user mode" And copy the participants of the previous two iterations of this thread. -- Justin
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
Thanks for picking up this badly-needed topic again! I was irresponsible last year and let it fall off my radar, but I'm looking at the patches, as well as revisiting discussions from the last four (!?) years that didn't lead to action.
0001:
+ In this condition the system can still execute read-only transactions.
+ The active transactions will continue to execute and will be able to
+ commit.
This is ambiguous. I'd first say that any transactions already started can continue, and then say that only new read-only transactions can be started.
0004:
-HINT: Stop the postmaster and vacuum that database in single-user mode.
+HINT: VACUUM or VACUUM FREEZE that database.
VACUUM FREEZE is worse and should not be mentioned, since it does unnecessary work. Emergency vacuum is not school -- you don't get extra credit for doing unnecessary work.
Also, we may consider adding a boxed NOTE warning specifically against single-user mode, especially if this recommendation will change in at least some minor releases so people may not hear about it. See also [1].
- * If we're past xidStopLimit, refuse to execute transactions, unless
- * we are running in single-user mode (which gives an escape hatch
- * to the DBA who somehow got past the earlier defenses).
+ * If we're past xidStopLimit, refuse to allocate new XIDs.
This patch doesn't completely get rid of the need for single-user mode, so it should keep all information about it. If a DBA wanted to e.g. drop or truncate a table to save vacuum time, it is still possible to do it in single-user mode, so the escape hatch is still useful.
In swapping this topic back in my head, I also saw [2] where Robert suggested
"that old prepared transactions and stale replication
slots should be emphasized more prominently. Maybe something like:
HINT: Commit or roll back old prepared transactions, drop stale
replication slots, or kill long-running sessions.
Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
That sounds like a good direction to me. There is more we could do here to make the message more specific [3][4][5], but the patches here are in the right direction.
Note for possible backpatching: It seems straightforward to go back to PG14, which has the failsafe, but we should have better testing in place first. There is a patch in this CF to make it easier to get close to wraparound, so I'll look at what it does as well.
[1] https://www.postgresql.org/message-id/CA%2BTgmoadjx%2Br8_gGbbnNifL6vEyjZntiQRPzyixrUihvtZ5jdQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA+Tgmob1QCMJrHwRBK8HZtGsr+6cJANRQw2mEgJ9e=D+z7cOsw@mail.gmail.com
[3] https://www.postgresql.org/message-id/20190504023015.5mgpbl27tld4irw5%40alap3.anarazel.de
[4] https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de
[5] https://www.postgresql.org/message-id/20220220045757.GA3733812%40rfd.leadboat.com
0001:
+ In this condition the system can still execute read-only transactions.
+ The active transactions will continue to execute and will be able to
+ commit.
This is ambiguous. I'd first say that any transactions already started can continue, and then say that only new read-only transactions can be started.
0004:
-HINT: Stop the postmaster and vacuum that database in single-user mode.
+HINT: VACUUM or VACUUM FREEZE that database.
VACUUM FREEZE is worse and should not be mentioned, since it does unnecessary work. Emergency vacuum is not school -- you don't get extra credit for doing unnecessary work.
Also, we may consider adding a boxed NOTE warning specifically against single-user mode, especially if this recommendation will change in at least some minor releases so people may not hear about it. See also [1].
- * If we're past xidStopLimit, refuse to execute transactions, unless
- * we are running in single-user mode (which gives an escape hatch
- * to the DBA who somehow got past the earlier defenses).
+ * If we're past xidStopLimit, refuse to allocate new XIDs.
This patch doesn't completely get rid of the need for single-user mode, so it should keep all information about it. If a DBA wanted to e.g. drop or truncate a table to save vacuum time, it is still possible to do it in single-user mode, so the escape hatch is still useful.
In swapping this topic back in my head, I also saw [2] where Robert suggested
"that old prepared transactions and stale replication
slots should be emphasized more prominently. Maybe something like:
HINT: Commit or roll back old prepared transactions, drop stale
replication slots, or kill long-running sessions.
Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
That sounds like a good direction to me. There is more we could do here to make the message more specific [3][4][5], but the patches here are in the right direction.
Note for possible backpatching: It seems straightforward to go back to PG14, which has the failsafe, but we should have better testing in place first. There is a patch in this CF to make it easier to get close to wraparound, so I'll look at what it does as well.
[1] https://www.postgresql.org/message-id/CA%2BTgmoadjx%2Br8_gGbbnNifL6vEyjZntiQRPzyixrUihvtZ5jdQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA+Tgmob1QCMJrHwRBK8HZtGsr+6cJANRQw2mEgJ9e=D+z7cOsw@mail.gmail.com
[3] https://www.postgresql.org/message-id/20190504023015.5mgpbl27tld4irw5%40alap3.anarazel.de
[4] https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de
[5] https://www.postgresql.org/message-id/20220220045757.GA3733812%40rfd.leadboat.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi John, > Thanks for picking up this badly-needed topic again! Many thanks for the review! > 0001: > > + In this condition the system can still execute read-only transactions. > + The active transactions will continue to execute and will be able to > + commit. > > This is ambiguous. I'd first say that any transactions already started can continue, and then say that only new read-onlytransactions can be started. Fixed. > 0004: > > -HINT: Stop the postmaster and vacuum that database in single-user mode. > +HINT: VACUUM or VACUUM FREEZE that database. > > VACUUM FREEZE is worse and should not be mentioned, since it does unnecessary work. Emergency vacuum is not school -- youdon't get extra credit for doing unnecessary work. Fixed. > Also, we may consider adding a boxed NOTE warning specifically against single-user mode, especially if this recommendationwill change in at least some minor releases so people may not hear about it. See also [1]. Done. > - * If we're past xidStopLimit, refuse to execute transactions, unless > - * we are running in single-user mode (which gives an escape hatch > - * to the DBA who somehow got past the earlier defenses). > + * If we're past xidStopLimit, refuse to allocate new XIDs. > > This patch doesn't completely get rid of the need for single-user mode, so it should keep all information about it. Ifa DBA wanted to e.g. drop or truncate a table to save vacuum time, it is still possible to do it in single-user mode, sothe escape hatch is still useful. Fixed. > In swapping this topic back in my head, I also saw [2] where Robert suggested > > "that old prepared transactions and stale replication > slots should be emphasized more prominently. Maybe something like: > > HINT: Commit or roll back old prepared transactions, drop stale > replication slots, or kill long-running sessions. > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM." It looks like the hint regarding replication slots was added at some point. Currently we have: ``` errhint( [...] "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); ``` So I choose to keep it as is for now. Please let me know if you think we should also add a suggestion to kill long-running sessions, etc. -- Best regards, Aleksander Alekseev
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Tue, Mar 21, 2023 at 6:44 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with 0004:
1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way to v11.
2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back as we safely can (and test).
(...and future work, so not part of the CF here) 3. Tell the user what caused the problem, instead of saying "go figure it out yourself".
> > In swapping this topic back in my head, I also saw [2] where Robert suggested
> >
> > "that old prepared transactions and stale replication
> > slots should be emphasized more prominently. Maybe something like:
> >
> > HINT: Commit or roll back old prepared transactions, drop stale
> > replication slots, or kill long-running sessions.
> > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
>
> It looks like the hint regarding replication slots was added at some
> point. Currently we have:
>
> ```
> errhint( [...]
> "You might also need to commit or roll back old prepared
> transactions, or drop stale replication slots.")));
> ```
Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.
> Please let me know if you think
> we should also add a suggestion to kill long-running sessions, etc.
+1 for also adding that.
- errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
+ errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"",
I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" doesn't really communicate anything useful. The people who understand exactly what that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place, and might even know enough to ignore the hint.
I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's more useful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'd like to give some idea of what they can and cannot do.
+ Previously it was required to stop the postmaster and VACUUM the database
+ in a single-user mode. There is no need to use a single-user mode anymore.
I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication and disables safeguards against wraparound. (Other bad things too, but these are easily understandable for users)
Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that would help speed up resolution.
I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc. If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that...
--
John Naylor
EDB: http://www.enterprisedb.com
Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with 0004:
1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way to v11.
2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back as we safely can (and test).
(...and future work, so not part of the CF here) 3. Tell the user what caused the problem, instead of saying "go figure it out yourself".
> > In swapping this topic back in my head, I also saw [2] where Robert suggested
> >
> > "that old prepared transactions and stale replication
> > slots should be emphasized more prominently. Maybe something like:
> >
> > HINT: Commit or roll back old prepared transactions, drop stale
> > replication slots, or kill long-running sessions.
> > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
>
> It looks like the hint regarding replication slots was added at some
> point. Currently we have:
>
> ```
> errhint( [...]
> "You might also need to commit or roll back old prepared
> transactions, or drop stale replication slots.")));
> ```
Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.
> Please let me know if you think
> we should also add a suggestion to kill long-running sessions, etc.
+1 for also adding that.
- errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
+ errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"",
I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" doesn't really communicate anything useful. The people who understand exactly what that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place, and might even know enough to ignore the hint.
I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's more useful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'd like to give some idea of what they can and cannot do.
+ Previously it was required to stop the postmaster and VACUUM the database
+ in a single-user mode. There is no need to use a single-user mode anymore.
I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication and disables safeguards against wraparound. (Other bad things too, but these are easily understandable for users)
Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that would help speed up resolution.
I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc. If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that...
--
John Naylor
EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi John, Many thanks for all the great feedback! > Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with0004: > > 1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way tov11. > 2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back aswe safely can (and test). Done. > > > In swapping this topic back in my head, I also saw [2] where Robert suggested > > > > > > "that old prepared transactions and stale replication > > > slots should be emphasized more prominently. Maybe something like: > > > > > > HINT: Commit or roll back old prepared transactions, drop stale > > > replication slots, or kill long-running sessions. > > > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM." > > > > It looks like the hint regarding replication slots was added at some > > point. Currently we have: > > > > ``` > > errhint( [...] > > "You might also need to commit or roll back old prepared > > transactions, or drop stale replication slots."))); > > ``` > > Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the pointto mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things sothat it is clear. > > > Please let me know if you think > > we should also add a suggestion to kill long-running sessions, etc. > > +1 for also adding that. OK, done. I included this change as a separate patch. It can be squashed with another one if necessary. While on it, I noticed that multixact.c still talks about a "shutdown". I made corresponding changes in 0001. > - errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"", > + errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"", > > I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous.I also bet that "generate XIDs" doesn't really communicate anything useful. The people who understand exactlywhat that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place,and might even know enough to ignore the hint. > > I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's moreuseful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'dlike to give some idea of what they can and cannot do. This particular wording was chosen for consistency with multixact.c: ``` errmsg("database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database \"%s\"", ``` The idea of using "writes" is sort of OK, but note that the same message will appear for a query like: ``` SELECT pg_current_xact_id(); ``` ... which doesn't do writes. The message will be misleading in this case. On top of that, although a PostgreSQL user may not be aware of MultiXactIds, arguably there are many users that are aware of XIDs. Not to mention the fact that XIDs are well documented. I didn't make this change in v4 since it seems to be controversial and probably not the highest priority at the moment. I suggest we discuss it separately. > I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc.If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that... Good idea. > + Previously it was required to stop the postmaster and VACUUM the database > + in a single-user mode. There is no need to use a single-user mode anymore. > > I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication anddisables safeguards against wraparound. (Other bad things too, but these are easily understandable for users) > > Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that wouldhelp speed up resolution. Fixed. -- Best regards, Aleksander Alekseev
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Mon, Apr 3, 2023 at 7:33 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
> > Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.
> >
> > > Please let me know if you think
> > > we should also add a suggestion to kill long-running sessions, etc.
> >
> > +1 for also adding that.
>
> OK, done. I included this change as a separate patch. It can be
> squashed with another one if necessary.
Okay, great. For v4-0003:
Each hint mentions vacuum twice, because it's adding the vacuum message at the end, but not removing it from the beginning. The vacuum string should be on its own line, since we will have to modify that for back branches (skip indexes and truncation).
I'm also having second thoughts about "Ensure that autovacuum is progressing" in the hint. That might be better in the docs, if we decide to go ahead with adding a specific checklist there.
In vacuum.c:
errhint("Close open transactions soon to avoid wraparound problems.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions, drop stale replication slots, or kill long-running sessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.")));
This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, and the open transactions were already mentioned, so this is not the place for this change.
> This particular wording was chosen for consistency with multixact.c:
>
> ```
> errmsg("database is not accepting commands that generate new
> MultiXactIds to avoid wraparound data loss in database \"%s\"",
> ```
Okay, I didn't look into that -- seems like a good precedent.
v4-0002:
- errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
+ errhint("VACUUM that database.\n"
Further in the spirit of consistency, the mulixact path already has "Execute a database-wide VACUUM in that database.\n", and that seems like better wording.
--
John Naylor
EDB: http://www.enterprisedb.com
> > Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.
> >
> > > Please let me know if you think
> > > we should also add a suggestion to kill long-running sessions, etc.
> >
> > +1 for also adding that.
>
> OK, done. I included this change as a separate patch. It can be
> squashed with another one if necessary.
Okay, great. For v4-0003:
Each hint mentions vacuum twice, because it's adding the vacuum message at the end, but not removing it from the beginning. The vacuum string should be on its own line, since we will have to modify that for back branches (skip indexes and truncation).
I'm also having second thoughts about "Ensure that autovacuum is progressing" in the hint. That might be better in the docs, if we decide to go ahead with adding a specific checklist there.
In vacuum.c:
errhint("Close open transactions soon to avoid wraparound problems.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions, drop stale replication slots, or kill long-running sessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.")));
This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, and the open transactions were already mentioned, so this is not the place for this change.
> This particular wording was chosen for consistency with multixact.c:
>
> ```
> errmsg("database is not accepting commands that generate new
> MultiXactIds to avoid wraparound data loss in database \"%s\"",
> ```
Okay, I didn't look into that -- seems like a good precedent.
v4-0002:
- errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
+ errhint("VACUUM that database.\n"
Further in the spirit of consistency, the mulixact path already has "Execute a database-wide VACUUM in that database.\n", and that seems like better wording.
--
John Naylor
EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Pavel Borisov
Date:
Hi! I've looked into the patches v4. For 0001: I think long "not accepting commands that generate" is equivalent to more concise "can't generate". For 0003: I think double mentioning of Vacuum at each errhist i.e.: "Execute a database-wide VACUUM in that database" and "...or run a manual database-wide VACUUM." are redundant. The advice "Ensure that autovacuum is progressing,..." is also not needed after advice to "Execute a database-wide VACUUM in that database". For all: In a errhint's list what _might_ be done I think AND is a little bit better that OR as the word _might_ means that each of the proposals in the list is a probable, not a sure one. The proposed changes are in patchset v5. Kind regards, Pavel Borisov, Supabase.
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi, > The proposed changes are in patchset v5. Pavel, John, thanks for your feedback. > I've looked into the patches v4. > For 0001: > I think long "not accepting commands that generate" is equivalent to > more concise "can't generate". Frankly I don't think this is a good change for this particular CF entry and it violates the consistency with multixact.c. Additionally the new message is not accurate. The DBMS _can_ generate new XIDs, quite a few of them actually. It merely refuses to do so. > For all: > In a errhint's list what _might_ be done I think AND is a little bit > better that OR as the word _might_ means that each of the proposals in > the list is a probable, not a sure one. Well, that's debatable... IMO "or" makes a bit more sense since the user knows better whether he/she needs to kill a long-running transaction, or run VACUUM, or maybe do both. "And" would imply that the user needs to do all of it, which is not necessarily true. Since originally it was "or" I suggest we keep it as is for now. All in all I would prefer keeping the focus on the particular problem the patch tries to address. > For 0003: > I think double mentioning of Vacuum at each errhist i.e.: "Execute a > database-wide VACUUM in that database" and "...or run a manual > database-wide VACUUM." are redundant. The advice "Ensure that > autovacuum is progressing,..." is also not needed after advice to > "Execute a database-wide VACUUM in that database". > [...] > Okay, great. For v4-0003: > > Each hint mentions vacuum twice, because it's adding the vacuum message at the end, but not removing it from the beginning.The vacuum string should be on its own line, since we will have to modify that for back branches (skip indexesand truncation). My bad. Fixed. > I'm also having second thoughts about "Ensure that autovacuum is progressing" in the hint. That might be better in thedocs, if we decide to go ahead with adding a specific checklist there. OK, done. > In vacuum.c: > > errhint("Close open transactions soon to avoid wraparound problems.\n" > - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); > + "You might also need to commit or roll back old prepared transactions, drop stale replication slots, or kill long-runningsessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."))); > > This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, and the open transactions were already mentioned,so this is not the place for this change. Fixed. > v4-0002: > > - errhint("Stop the postmaster and vacuum that database in single-user mode.\n" > + errhint("VACUUM that database.\n" > > Further in the spirit of consistency, the mulixact path already has "Execute a database-wide VACUUM in that database.\n",and that seems like better wording. Agree. Fixed. -- Best regards, Aleksander Alekseev
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Pavel Borisov
Date:
On Tue, 4 Apr 2023 at 17:08, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi, > > > The proposed changes are in patchset v5. > > Pavel, John, thanks for your feedback. > > > I've looked into the patches v4. > > For 0001: > > I think long "not accepting commands that generate" is equivalent to > > more concise "can't generate". > > Frankly I don't think this is a good change for this particular CF > entry and it violates the consistency with multixact.c. Additionally > the new message is not accurate. The DBMS _can_ generate new XIDs, > quite a few of them actually. It merely refuses to do so. > > > For all: > > In a errhint's list what _might_ be done I think AND is a little bit > > better that OR as the word _might_ means that each of the proposals in > > the list is a probable, not a sure one. > > Well, that's debatable... IMO "or" makes a bit more sense since the > user knows better whether he/she needs to kill a long-running > transaction, or run VACUUM, or maybe do both. "And" would imply that > the user needs to do all of it, which is not necessarily true. Since > originally it was "or" I suggest we keep it as is for now. > > All in all I would prefer keeping the focus on the particular problem > the patch tries to address. > > > For 0003: > > I think double mentioning of Vacuum at each errhist i.e.: "Execute a > > database-wide VACUUM in that database" and "...or run a manual > > database-wide VACUUM." are redundant. The advice "Ensure that > > autovacuum is progressing,..." is also not needed after advice to > > "Execute a database-wide VACUUM in that database". > > [...] > > > Okay, great. For v4-0003: > > > > Each hint mentions vacuum twice, because it's adding the vacuum message at the end, but not removing it from the beginning.The vacuum string should be on its own line, since we will have to modify that for back branches (skip indexesand truncation). > > My bad. Fixed. > > > I'm also having second thoughts about "Ensure that autovacuum is progressing" in the hint. That might be better in thedocs, if we decide to go ahead with adding a specific checklist there. > > OK, done. > > > In vacuum.c: > > > > errhint("Close open transactions soon to avoid wraparound problems.\n" > > - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); > > + "You might also need to commit or roll back old prepared transactions, drop stale replication slots, or kill long-runningsessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."))); > > > > This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, and the open transactions were already mentioned,so this is not the place for this change. > > Fixed. > > > v4-0002: > > > > - errhint("Stop the postmaster and vacuum that database in single-user mode.\n" > > + errhint("VACUUM that database.\n" > > > > Further in the spirit of consistency, the mulixact path already has "Execute a database-wide VACUUM in that database.\n",and that seems like better wording. > > Agree. Fixed. Alexander, Ok, nice! I think it could be moved to committer then. Pavel.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Tue, Apr 4, 2023 at 8:08 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
> [v6]
0001:
Looks good to me. I've just made some small edits for v7 and wrote a commit message to explain how we got here. This can be backpatched all the way, as it's simply a correction. I do want to test on v11 first just for completeness. (The reality has already been tested by others back to 9.6 but there's no substitute for trying it yourself). I hope to commit soon after that.
0002:
I've been testing wraparound using the v3 convenience function in [1] to reach xidStopLimit:
-- reduce log spam
alter system set log_min_messages = error;
alter system set client_min_messages = error;
-- restart
-- no actual replication, just for testing dropping it
select pg_create_physical_replication_slot('foo', true, false);
create table t (i int);
BEGIN;
insert into t values(1);
PREPARE TRANSACTION 'trx_id_pin';
-- get to xidStopLimit
select consume_xids(1*1000*1000*1000);
insert into t values(1);
select consume_xids(1*1000*1000*1000);
insert into t values(1);
select consume_xids( 140*1000*1000);
insert into t values(1);
select consume_xids( 10*1000*1000);
SELECT datname, age(datfrozenxid) FROM pg_database;
-- works just fine
select pg_drop_replication_slot('foo');
COMMIT PREPARED 'trx_id_pin';
-- watch autovacuum take care of it automatically:
SELECT datname, age(datfrozenxid) FROM pg_database;
The consume_xids function builds easily on PG14, but before that it would need a bit of work because data types changed. That coincidentally was the first version to include the failsafe, which is convenient in this scenario. I'd like to do testing on PG12/13 before commit, which would require turning off truncation in the command (and can also be made faster by turning off index cleanup), but I'm also okay with going ahead with just going back to PG14 at first. That also safest.
I made some small changes and wrote a suitably comprehensive commit message. I separated the possible uses for single-user mode into a separate paragraph in the "Note:" , moved the justification for the 3-million xid margin there, and restored the link to how to run it (I already mentioned we still need this info, but didn't notice this part didn't make it back in).
0003:
It really needs a more comprehensive change, and just making a long hint even longer doesn't seem worth doing. I'd like to set that aside and come back to it. I've left it out of the attached set.
[1] https://www.postgresql.org/message-id/CAD21AoBZ3t%2BfRtVamQTA%2BwBJaapFUY1dfP08-rxsQ%2BfouPvgKg%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Sat, Apr 29, 2023 at 1:09 AM John Naylor <john.naylor@enterprisedb.com> wrote: > Looks good to me. I'm strongly in favor of this. It's most unfortunate that it took this long. > I've just made some small edits for v7 and wrote a commit message to explain how we got here. This can be backpatched allthe way, as it's simply a correction. +1 to backpatching at least back until v14. After all, it wouldn't make any sense for us to not backpatch to 14; the whole justification for doing this was in no way influenced by anything that happened since the failsafe went in. I'm also in favor of backpatching to 11, 12, and 13 -- though I acknowledge that that's more of a judgement call. As you note in comments in the draft patch, the story with these earlier releases (especially 11) is a little more complicated for users. > I made some small changes and wrote a suitably comprehensive commit message. I separated the possible uses for single-usermode into a separate paragraph in the "Note:" , moved the justification for the 3-million xid margin there, andrestored the link to how to run it (I already mentioned we still need this info, but didn't notice this part didn't makeit back in). I notice that you've called xidStopLimit "read-only mode" in the docs. I think that it makes sense that you wouldn't use the term xidStopLimit here, but I'm not sure about this terminology, either. It seems to me that it should be something quite specific, since we're talking about a very specific mechanism. Whatever it is, It shouldn't contain the word "wraparound". Separately, there is a need to update a couple of other places to use this new terminology. The documentation for vacuum_sailsafe_age and vacuum_multixact_failsafe_age refer to "system-wide transaction ID wraparound failure", which seems less than ideal, even without your patch. Do we need two new names? One for xidStopLimit, another for multiStopLimit? The latter really can't be called read-only mode. > 0003: > > It really needs a more comprehensive change, and just making a long hint even longer doesn't seem worth doing. I'd liketo set that aside and come back to it. I've left it out of the attached set. Yeah, 0003 can be treated as independent work IMV. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Sun, Apr 30, 2023 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Apr 29, 2023 at 1:09 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > I made some small changes and wrote a suitably comprehensive commit message. I separated the possible uses for single-user mode into a separate paragraph in the "Note:" , moved the justification for the 3-million xid margin there, and restored the link to how to run it (I already mentioned we still need this info, but didn't notice this part didn't make it back in).
>
> I notice that you've called xidStopLimit "read-only mode" in the docs.
> I think that it makes sense that you wouldn't use the term
> xidStopLimit here, but I'm not sure about this terminology, either. It
> seems to me that it should be something quite specific, since we're
> talking about a very specific mechanism. Whatever it is, It shouldn't
> contain the word "wraparound".
How about
-HINT: To avoid a database shutdown, [...]
+HINT: To prevent XID exhaustion, [...]
...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc.
(I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005).
> Separately, there is a need to update a couple of other places to use
> this new terminology. The documentation for vacuum_sailsafe_age and
> vacuum_multixact_failsafe_age refer to "system-wide transaction ID
> wraparound failure", which seems less than ideal, even without your
> patch.
Right, I'll have a look.
> Do we need two new names? One for xidStopLimit, another for
> multiStopLimit? The latter really can't be called read-only mode.
Thanks for that correction.
Somewhat related to the now-postponed 0003: I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. The previous practice of shutting down had the side-effect of e.g. rolling back all in-progress transactions whose XID appeared in a MXID but if you remain in normal mode there is a bit more to check. Manual VACUUM will warn about "cutoff for removing and freezing tuples is far in the past", but the docs should be clear on this.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Sat, Apr 29, 2023 at 7:30 PM John Naylor <john.naylor@enterprisedb.com> wrote: > How about > > -HINT: To avoid a database shutdown, [...] > +HINT: To prevent XID exhaustion, [...] > > ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs havebeen exhausted", etc. I think that that particular wording works in this example -- we *are* avoiding XID exhaustion. But it still doesn't really address my concern -- at least not on its own. I think that we need a term for xidStopLimit mode (and perhaps multiStopLimit) itself. This is a discrete state/mode that is associated with a specific mechanism. I'd like to emphasize the purpose of xidStopLimit (over when xidStopLimit happens) in choosing this user-facing name. As you know, the point of xidStopLimit mode is to give autovacuum the chance to catch up with managing the XID space through freezing: the available supply of XIDs doesn't meet present demand, and hasn't for some time, so it finally came to this. Even if we had true 64-bit XIDs we'd probably still need something similar -- there would still have to be *some* point that allowing the "freezing deficit" to continue to grow just wasn't tenable. If a person consistently spends more than they take in, their "initial bankroll" isn't necessarily relevant. If our ~2.1 billion XID "bankroll" wasn't enough to avoid xidStopLimit, why would we expect 8 billion or 20 billion XIDs to have been enough? I'm thinking of a user-facing name for xidStopLimit along the lines of "emergency XID allocation restoration mode" (admittedly that's quite a mouthful). Something that carries the implication of "imbalance". The system was configured in a way that turned out to be unsustainable. The system was therefore forced to "restore sustainability" using the only tool that remained. This is closely related to the failsafe. As bad as xidStopLimit is, it won't always be the end of the world -- much depends on individual application requirements. > (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when theyarrived also in 2005). > > > Separately, there is a need to update a couple of other places to use > > this new terminology. The documentation for vacuum_sailsafe_age and > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID > > wraparound failure", which seems less than ideal, even without your > > patch. > > Right, I'll have a look. As you know, there is a more general problem with the use of the term "wraparound" in the docs, and in the system itself (in places like pg_stat_activity). Even the very basic terminology in this area is needlessly scary. Terms like "VACUUM (to prevent wraparound)" are uncomfortably close to "a race against time to avoid data corruption". The system isn't ever supposed to corrupt data, even if misconfigured (unless the misconfiguration is very low-level, such as "fsync=off"). Users should be able to take that much for granted. I don't expect either of us to address that problem in the short term -- the term "wraparound" is too baked-in for it to be okay to just remove it overnight. But, it could still make sense for your patch (or my own) to fully own the fact that "wraparound" is actually a misnomer. At least when used in contexts like "to prevent wraparound" (xidStopLimit actually "prevents wraparound", though we shouldn't say anything about it in a place of prominence). Let's reassure users that they should continue to take "we won't corrupt your data for no good reason" for granted. > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. I had planned to address this with my ongoing work on the "Routine Vacuuming" docs, but I think that you're right about the necessity of addressing it as part of this patch. These extra steps will be required whenever the problem is a leaked prepared transaction, or something along those lines. That is increasingly likely to turn out to be the underlying cause of entering xidStopLimit, given the work we've done on VACUUM over the years. I still think that "imbalance" is the right way to frame discussion of xidStopLimit. After all, autovacuum/VACUUM will still spin its wheels in a futile effort to "restore balance". So it's kinda still about restoring imbalance IMV. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Mon, May 1, 2023 at 2:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Apr 29, 2023 at 7:30 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > How about
> >
> > -HINT: To avoid a database shutdown, [...]
> > +HINT: To prevent XID exhaustion, [...]
> >
> > ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc.
>
> I think that that particular wording works in this example -- we *are*
> avoiding XID exhaustion. But it still doesn't really address my
> concern -- at least not on its own. I think that we need a term for
> xidStopLimit mode (and perhaps multiStopLimit) itself. This is a
> discrete state/mode that is associated with a specific mechanism.
Well, since you have a placeholder "xidStopLimit mode" in your other patch, I'll confine my response to there. Inventing "modes" seems like an awkward thing to backpatch, not to mention moving the goalposts. My modest goal here is quite limited: to stop lying to our users about "not accepting commands", and change tragically awful advice into sensible advice.
Here's my new idea:
-HINT: To avoid a database shutdown, [...]
+HINT: To prevent XID generation failure, [...]
Actually, I like "allocation" better, but the v8 patch now has "generation" simply because one MXID message already has "generate" and I did it that way before thinking too hard. I'd be okay with either one as long as it's consistent.
> > (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005).
Done
> > > Separately, there is a need to update a couple of other places to use
> > > this new terminology. The documentation for vacuum_sailsafe_age and
> > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID
> > > wraparound failure", which seems less than ideal, even without your
> > > patch.
> >
> > Right, I'll have a look.
Looking now, I'm even less inclined to invent new terminology in back branches.
> As you know, there is a more general problem with the use of the term
> "wraparound" in the docs, and in the system itself (in places like
> pg_stat_activity). Even the very basic terminology in this area is
> needlessly scary. Terms like "VACUUM (to prevent wraparound)" are
> uncomfortably close to "a race against time to avoid data corruption".
> The system isn't ever supposed to corrupt data, even if misconfigured
> (unless the misconfiguration is very low-level, such as "fsync=off").
> Users should be able to take that much for granted.
Granted. Whatever form your rewrite ends up in, it could make a lot of sense to then backpatch a few localized corrections. I wouldn't even object to including a few substitutions of s/wraparound failure/allocation failure/ where appropriate. Let's see how that shakes out first.
> > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion.
>
> I had planned to address this with my ongoing work on the "Routine
> Vacuuming" docs, but I think that you're right about the necessity of
> addressing it as part of this patch.
0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Sat, Apr 29, 2023 at 7:30 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > How about
> >
> > -HINT: To avoid a database shutdown, [...]
> > +HINT: To prevent XID exhaustion, [...]
> >
> > ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc.
>
> I think that that particular wording works in this example -- we *are*
> avoiding XID exhaustion. But it still doesn't really address my
> concern -- at least not on its own. I think that we need a term for
> xidStopLimit mode (and perhaps multiStopLimit) itself. This is a
> discrete state/mode that is associated with a specific mechanism.
Well, since you have a placeholder "xidStopLimit mode" in your other patch, I'll confine my response to there. Inventing "modes" seems like an awkward thing to backpatch, not to mention moving the goalposts. My modest goal here is quite limited: to stop lying to our users about "not accepting commands", and change tragically awful advice into sensible advice.
Here's my new idea:
-HINT: To avoid a database shutdown, [...]
+HINT: To prevent XID generation failure, [...]
Actually, I like "allocation" better, but the v8 patch now has "generation" simply because one MXID message already has "generate" and I did it that way before thinking too hard. I'd be okay with either one as long as it's consistent.
> > (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005).
Done
> > > Separately, there is a need to update a couple of other places to use
> > > this new terminology. The documentation for vacuum_sailsafe_age and
> > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID
> > > wraparound failure", which seems less than ideal, even without your
> > > patch.
> >
> > Right, I'll have a look.
Looking now, I'm even less inclined to invent new terminology in back branches.
> As you know, there is a more general problem with the use of the term
> "wraparound" in the docs, and in the system itself (in places like
> pg_stat_activity). Even the very basic terminology in this area is
> needlessly scary. Terms like "VACUUM (to prevent wraparound)" are
> uncomfortably close to "a race against time to avoid data corruption".
> The system isn't ever supposed to corrupt data, even if misconfigured
> (unless the misconfiguration is very low-level, such as "fsync=off").
> Users should be able to take that much for granted.
Granted. Whatever form your rewrite ends up in, it could make a lot of sense to then backpatch a few localized corrections. I wouldn't even object to including a few substitutions of s/wraparound failure/allocation failure/ where appropriate. Let's see how that shakes out first.
> > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion.
>
> I had planned to address this with my ongoing work on the "Routine
> Vacuuming" docs, but I think that you're right about the necessity of
> addressing it as part of this patch.
0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Mon, May 1, 2023 at 5:34 AM John Naylor <john.naylor@enterprisedb.com> wrote: > Well, since you have a placeholder "xidStopLimit mode" in your other patch, I'll confine my response to there. Inventing"modes" seems like an awkward thing to backpatch, not to mention moving the goalposts. My modest goal here is quitelimited: to stop lying to our users about "not accepting commands", and change tragically awful advice into sensibleadvice. I can't argue with that. > Here's my new idea: > > -HINT: To avoid a database shutdown, [...] > +HINT: To prevent XID generation failure, [...] > > Actually, I like "allocation" better, but the v8 patch now has "generation" simply because one MXID message already has"generate" and I did it that way before thinking too hard. I'd be okay with either one as long as it's consistent. WFM. > Granted. Whatever form your rewrite ends up in, it could make a lot of sense to then backpatch a few localized corrections.I wouldn't even object to including a few substitutions of s/wraparound failure/allocation failure/ where appropriate.Let's see how that shakes out first. Makes sense. > > > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. > > > > I had planned to address this with my ongoing work on the "Routine > > Vacuuming" docs, but I think that you're right about the necessity of > > addressing it as part of this patch. > > 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part,just to get something together. I'd like to abbreviate that somehow. Yeah, the need to abbreviate statements about MultiXact IDs by saying that they work analogously to XIDs in some particular respect is...another thing that makes this tricky. I don't think that Multis are fundamentally different to XIDs. I believe that the process through which VACUUM establishes its OldestMXact cutoff can be pessimistic compared to OldestXmin, but I don't think that it changes the guidance you'll need to give here. VACUUM should always be able to advance relminmxid right up until OldestMXact, if that's what the user insists on. For example, VACUUM FREEZE sometimes allocates new Multis, just to be able to do that. Obviously there are certain things that can hold back OldestMXact by a wildly excessive amount. But I don't think that there is anything that can hold back OldestMXact by a wildly excessive amount that won't more or less do the same thing to OldestXmin. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Mon, May 1, 2023 at 7:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > Obviously there are certain things that can hold back OldestMXact by a > wildly excessive amount. But I don't think that there is anything that > can hold back OldestMXact by a wildly excessive amount that won't more > or less do the same thing to OldestXmin. Actually, it's probably possible for a transaction that only has a virtual transaction ID to call MultiXactIdSetOldestVisible(), which will then have the effect of holding back OldestMXact without also holding back OldestXmin (in READ COMMITTED mode). Will have to check to make sure, but that won't happen today. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Tue, May 2, 2023 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, May 1, 2023 at 5:34 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> > 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow.
>
> Yeah, the need to abbreviate statements about MultiXact IDs by saying
> that they work analogously to XIDs in some particular respect
> is...another thing that makes this tricky.
Then it sounds like they should stay separate. A direct copy-paste is not good for style, so I will add things like:
- If for some reason autovacuum fails to clear old MXIDs from a table, the
+ As in the case with XIDs, it is possible for autovacuum to fail to [...]
It might least be good for readability to gloss over the warning and only quote the MXID limit error message, but we'll have to see how it looks.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Mon, May 1, 2023 at 5:34 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> > 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow.
>
> Yeah, the need to abbreviate statements about MultiXact IDs by saying
> that they work analogously to XIDs in some particular respect
> is...another thing that makes this tricky.
Then it sounds like they should stay separate. A direct copy-paste is not good for style, so I will add things like:
- If for some reason autovacuum fails to clear old MXIDs from a table, the
+ As in the case with XIDs, it is possible for autovacuum to fail to [...]
It might least be good for readability to gloss over the warning and only quote the MXID limit error message, but we'll have to see how it looks.
--
John Naylor
EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Tue, May 2, 2023 at 6:46 PM John Naylor <john.naylor@enterprisedb.com> wrote: > It might least be good for readability to gloss over the warning and only quote the MXID limit error message, but we'llhave to see how it looks. Apparently you expect me to join you in pretending that you didn't lambast my review on this thread less than 24 hours ago [1]. I happen to believe that this particular patch is of great strategic importance, so I'll admit that I thought about it for a second. But just a second -- I have more self-respect than that. That's not the only reason, though. I also genuinely don't have the foggiest notion what was behind what you said. In particular, this part still makes zero sense to me: "Claim that others are holding you back, and then try to move the goalposts in their work" Let me get this straight: "Moving the goalposts of their work" refers to something *I* did to *you*, on *this* thread...right? If anything, I'm biased in *favor* of this patch. The fact that you seem to think that I was being obstructionist just doesn't make any sense to me, at all. I really don't know where to go from there. I'm not so much upset as baffled. [1] https://postgr.es/m/CAFBsxsGJMp43QO2cLAh0==ueYVL35pbbEHeXZ0cnZkU=q8sFkg@mail.gmail.com -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Wed, May 3, 2023 at 10:04 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> That's not the only reason, though. I also genuinely don't have the
> foggiest notion what was behind what you said. In particular, this
> part still makes zero sense to me:
>
> "Claim that others are holding you back, and then try to move the
> goalposts in their work"
I went to go find the phrase that I thought I was reacted to, and ... nothing. I am also baffled. My comment was inexcusable.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Wed, May 3, 2023 at 12:30 AM John Naylor <john.naylor@enterprisedb.com> wrote: > I went to go find the phrase that I thought I was reacted to, and ... nothing. I am also baffled. My comment was inexcusable. I'd quite like to drop this topic, and get on with the work at hand. But before I do that, I ask you to consider one thing: if you were mistaken about my words (or their intent) on this occasion, isn't it also possible that it wasn't the first time? I never had the opportunity to sit down to talk with you face to face before now. If things had been different (if we managed to talk at one of the PGCons before COVID, say), then maybe this incident would have happened in just the same way. I can't help but think that some face time would have prevented the whole episode, though. You have every right to dislike me on a personal level, of course, but if you do then I'd very much prefer that it be due to one of my actual flaws. I'm not a petty man -- I don't resent the success of others. I've always thought that you do rather good work. Plus I'm just not in the habit of obstructing things that I directly benefit from. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Thu, May 4, 2023 at 12:59 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I'd quite like to drop this topic, and get on with the work at hand.
I'd be grateful, and the other points you made are, of course, valid.
--
John Naylor
EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
Attached is v9, which is mostly editing the steps for restoring normal operation, which are in 0003 now but will be squashed into 0002. Comments to polish the wording welcome.
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Fri, May 12, 2023 at 9:14 PM John Naylor <john.naylor@enterprisedb.com> wrote: > Attached is v9, which is mostly editing the steps for restoring normal operation, which are in 0003 now but will be squashedinto 0002. Comments to polish the wording welcome. I'll try to get you more feedback on this soon. BTW, Google cloud already just instruct their users to ignore the xidStopLimit HINT about single user mode: https://cloud.google.com/sql/docs/postgres/txid-wraparound I checked with archive.org. This directive to just ignore the HINT appeared for the first time no later than December 2021. Fixing this in Postgres is long overdue. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Sun, May 14, 2023 at 6:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > BTW, Google cloud already just instruct their users to ignore the > xidStopLimit HINT about single user mode: > > https://cloud.google.com/sql/docs/postgres/txid-wraparound I read this just today, and was reminded of this thread: https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-under-the-hood-adaptive-autovacuum It reads: "1. Transaction ID wraparound: PostgreSQL transaction IDs or XIDs are 32-bit unsigned integers that are assigned to each transaction and also get incremented. When they reach their maximum value, it would wrap around to zero (similar to a ring buffer) and can lead to data corruption." -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Eisentraut
Date:
On 20.09.23 05:41, Peter Geoghegan wrote: > On Sun, May 14, 2023 at 6:06 PM Peter Geoghegan <pg@bowt.ie> wrote: >> BTW, Google cloud already just instruct their users to ignore the >> xidStopLimit HINT about single user mode: >> >> https://cloud.google.com/sql/docs/postgres/txid-wraparound > > I read this just today, and was reminded of this thread: > > https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-under-the-hood-adaptive-autovacuum > > It reads: > > "1. Transaction ID wraparound: PostgreSQL transaction IDs or XIDs are > 32-bit unsigned integers that are assigned to each transaction and > also get incremented. When they reach their maximum value, it would > wrap around to zero (similar to a ring buffer) and can lead to data > corruption." What is the status of this patch discussion now? It had been set as Ready for Committer for some months. Do these recent discussions invalidate that? Does it need more discussion?
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Sun, Oct 1, 2023 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote: > What is the status of this patch discussion now? It had been set as > Ready for Committer for some months. Do these recent discussions > invalidate that? Does it need more discussion? I don't think that recent discussion invalidated anything. I meant to follow-up on investigating the extent to which anything could hold up OldestMXact without also holding up OldestXmin/removable cutoff, but that doesn't seem essential. This patch does indeed seem "ready for committer". John? -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Pavel Borisov
Date:
Hi! On Mon, 2 Oct 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Oct 1, 2023 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > What is the status of this patch discussion now? It had been set as > > Ready for Committer for some months. Do these recent discussions > > invalidate that? Does it need more discussion? > > I don't think that recent discussion invalidated anything. I meant to > follow-up on investigating the extent to which anything could hold up > OldestMXact without also holding up OldestXmin/removable cutoff, but > that doesn't seem essential. > > This patch does indeed seem "ready for committer". John? > > -- > Peter Geoghegan FWIW I think the patch is still in good shape and worth to be committed. Regards, Pavel Borisov Supabase
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Mon, Oct 2, 2023 at 11:52 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > FWIW I think the patch is still in good shape and worth to be committed. I'm also pretty happy with these patches and would like to see at least 0001 and 0002 committed, and probably 0003 as well. I am, however, -1 on back-patching. Perhaps that is overly cautious, but I don't like changing existing messages in back-branches. It will break translations, and potentially monitoring scripts, etc. If John's not available to take this forward, I can volunteer as substitute committer, unless Peter or Peter would like to handle it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Mon, Oct 2, 2023 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'm also pretty happy with these patches and would like to see at > least 0001 and 0002 committed, and probably 0003 as well. I am, > however, -1 on back-patching. Perhaps that is overly cautious, but I > don't like changing existing messages in back-branches. It will break > translations, and potentially monitoring scripts, etc. > > If John's not available to take this forward, I can volunteer as > substitute committer, unless Peter or Peter would like to handle it. If you're willing to take over as committer here, I'll let the issue of backpatching go. I only ask that you note why you've not backpatched in the commit message. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 8:07 AM Peter Geoghegan <pg@bowt.ie> wrote: > If you're willing to take over as committer here, I'll let the issue > of backpatching go. > > I only ask that you note why you've not backpatched in the commit message. Will do, but see also the last point below. I have looked over these patches in some detail and here are my thoughts: - I find the use of the word "generate" in error messages slightly odd. I think it's reasonable given the existing precedent, but the word I would have picked is "assign", which I see is what Aleksander actually had in v1. How would people feel about changing the two existing messages that say "database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss ..." to use "assign" instead, and then make the new messages match that? - I think that 0002 needs a bit of wordsmithing. I will work on that. In particular, I don't like this sentence: "It increases downtime, makes monitoring impossible, disables replication, bypasses safeguards against wraparound, etc." While there's nothing untrue there, it feels more like a sentence from a pgsql-hackers email where most people participating in the discussion understand the general contours of the problem already than like polished documentation that really lays things out methodically. - I'm somewhat inclined to have a go at restructuring these patches a bit so that some of the documentation changes can potentially be back-patched without back-patching the message changes. Even if we eventually decide to back-patch everything or nothing, there are wording adjustments spread across all 3 patches that seem somewhat independent of the changes to the server messages. I think it would be clearer to have one patch that is mostly about documentation wording changes, and a second one that is about changing the server messages and then making documentation changes that are directly dependent on those message changes. And I might also be inclined to back-patch the former patch as far as it makes sense to do so, while leaving the latter one master-only. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Thu, Oct 12, 2023 at 8:54 AM Robert Haas <robertmhaas@gmail.com> wrote: > - I find the use of the word "generate" in error messages slightly > odd. I think it's reasonable given the existing precedent, but the > word I would have picked is "assign", which I see is what Aleksander > actually had in v1. How would people feel about changing the two > existing messages that say "database is not accepting commands that > generate new MultiXactIds to avoid wraparound data loss ..." to use > "assign" instead, and then make the new messages match that? WFM. > - I think that 0002 needs a bit of wordsmithing. I will work on that. > In particular, I don't like this sentence: "It increases downtime, > makes monitoring impossible, disables replication, bypasses safeguards > against wraparound, etc." While there's nothing untrue there, it feels > more like a sentence from a pgsql-hackers email where most people > participating in the discussion understand the general contours of the > problem already than like polished documentation that really lays > things out methodically. I agree. > - I'm somewhat inclined to have a go at restructuring these patches a > bit so that some of the documentation changes can potentially be > back-patched without back-patching the message changes. Even if we > eventually decide to back-patch everything or nothing, there are > wording adjustments spread across all 3 patches that seem somewhat > independent of the changes to the server messages. I think it would be > clearer to have one patch that is mostly about documentation wording > changes, and a second one that is about changing the server messages > and then making documentation changes that are directly dependent on > those message changes. And I might also be inclined to back-patch the > former patch as far as it makes sense to do so, while leaving the > latter one master-only. No objections from me. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > No objections from me. Here is a doc-only patch that I think could be back-patched as far as emergency mode exists. It combines all of the wording changes to the documentation from v1-v3 of the previous version, but without changing the message text that is quoted in the documentation, and without adding more instances of similar message texts to the documentation, and with a bunch of additional hacking by me. Some things I changed: - I made it so that the MXID section refers back to the XID section instead of duplicating it, but with a short list of differences. - I weakened the existing claim that says you must be a superuser or VACUUM definitely won't fix it to say instead that you SHOULD run VACUUM as the superuser, because the former is false and the latter is true. - I made the list of steps for recovering more explicit. - I split out the bit about running autovacuum in the affected database into a separate step to be performed after VACUUM for continued good operation, rather than a necessary ingredient in recovery, because it isn't. - A bit of other minor rejiggering. I'm not forgetting about the rest of the proposed patch set, or the change I proposed earlier. I'm just posting this much now because this is how far I got today, and it would be useful to get comments before I go further. I think the residual portion of the patch set not included in this documentation patch will be quite small, and I think that's a good thing, but again, I don't intend to blow that off. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Thu, Oct 12, 2023 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > > No objections from me. > > Here is a doc-only patch that I think could be back-patched as far as > emergency mode exists. It combines all of the wording changes to the > documentation from v1-v3 of the previous version, but without changing > the message text that is quoted in the documentation, and without > adding more instances of similar message texts to the documentation, > and with a bunch of additional hacking by me. It's a bit weird that we're effectively saying "pay no attention to that terrible HINT"...but I get it. The simple fact is that the docs were written in a way that allowed misinformation to catch on -- the damage that needs to be undone isn't exactly limited to the docs themselves. Your choice to not backpatch the changes to the log messages makes a lot more sense, now that I see that I see the wider context built by this preparatory patch. Arguably, it would be counterproductive to pretend that we didn't make this mistake on the backbranches. Better to own the mistake. > Some things I changed: > > - I made it so that the MXID section refers back to the XID section > instead of duplicating it, but with a short list of differences. > - I weakened the existing claim that says you must be a superuser or > VACUUM definitely won't fix it to say instead that you SHOULD run > VACUUM as the superuser, because the former is false and the latter is > true. > - I made the list of steps for recovering more explicit. > - I split out the bit about running autovacuum in the affected > database into a separate step to be performed after VACUUM for > continued good operation, rather than a necessary ingredient in > recovery, because it isn't. > - A bit of other minor rejiggering. Those all make sense to me. > I'm not forgetting about the rest of the proposed patch set, or the > change I proposed earlier. I'm just posting this much now because this > is how far I got today, and it would be useful to get comments before > I go further. I think the residual portion of the patch set not > included in this documentation patch will be quite small, and I think > that's a good thing, but again, I don't intend to blow that off. Of course. Your general approach seems wise. Thanks for working on this. I will be relieved once this is finally taken care of. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi, > Those all make sense to me. > > > [...] > > Of course. Your general approach seems wise. > > Thanks for working on this. I will be relieved once this is finally > taken care of. +1, and many thanks for your attention to the patchset and all the details! -- Best regards, Aleksander Alekseev
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Fri, Oct 13, 2023 at 5:03 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > > Thanks for working on this. I will be relieved once this is finally > > taken care of. > > +1, and many thanks for your attention to the patchset and all the details! Cool. I committed that and back-patched to v14, and here's the rest. 0001 makes the terminology change that I proposed earlier, and then 0002 is the remainder of what was in the previous patch set that wasn't covered by what I committed already, with a few adjustments. In particular, I preferred to stick with "avoid" rather than changing to "prevent," and I thought it was clearer to refer to "failures" plural rather than "failure" collective. These are arguable decisions, though. I propose to commit these changes only to master. I have included a fairly long paragraph about that in the commit message for 0002. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Peter Geoghegan
Date:
On Mon, Oct 16, 2023 at 11:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > I propose to commit these changes only to master. I have included a > fairly long paragraph about that in the commit message for 0002. LGTM, except for one small detail: I noticed that you misspelled "translations" in the commit message. Thanks for getting this over the line -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Mon, Oct 16, 2023 at 3:46 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Oct 16, 2023 at 11:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I propose to commit these changes only to master. I have included a > > fairly long paragraph about that in the commit message for 0002. > > LGTM, except for one small detail: I noticed that you misspelled > "translations" in the commit message. Oops. Fixed locally. > Thanks for getting this over the line Sure thing. I'm glad we're finally doing something about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Aleksander Alekseev
Date:
Hi, > > LGTM, except for one small detail: I noticed that you misspelled > > "translations" in the commit message. > > Oops. Fixed locally. v11-0001 and v11-0002 LGTM too. IMO "to assign a XID" sounds better than "to generate a XID". -- Best regards, Aleksander Alekseev
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Robert Haas
Date:
On Tue, Oct 17, 2023 at 4:57 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > v11-0001 and v11-0002 LGTM too. Cool. Seems we are all in agreement, so committed these. Thanks! -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
John Naylor
Date:
On Tue, Oct 17, 2023 at 9:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Cool. Seems we are all in agreement, so committed these. Thanks! Thank you for getting this across the finish line!
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
From
Alexander Lakhin
Date:
Hello Robert, 17.10.2023 17:39, Robert Haas wrote: > On Tue, Oct 17, 2023 at 4:57 AM Aleksander Alekseev > <aleksander@timescale.com> wrote: >> v11-0001 and v11-0002 LGTM too. > Cool. Seems we are all in agreement, so committed these. Thanks! Please look at the following sentence added by the commit: ... to issue manual <command>VACUUM</command> commands on the tables where <structfield>relminxid</structfield> is oldest. Isn't relminxid a typo there? Best regards, Alexander