Thread: Add 64-bit XIDs into PostgreSQL 15
Long time wraparound was a really big pain for highly loaded systems. One source of performance degradation is the need to vacuum before every wraparound.
And there were several proposals to make XIDs 64-bit like [1], [2], [3] and [4] to name a few.
The approach [2] seems to have stalled on CF since 2018. But meanwhile it was successfully being used in our Postgres Pro fork all time since then. We have hundreds of customers using 64-bit XIDs. Dozens of instances are under load that require wraparound each 1-5 days with 32-bit XIDs.
It really helps the customers with a huge transaction load that in the case of 32-bit XIDs could experience wraparounds every day. So I'd like to propose this approach modification to CF.
PFA updated working patch v6 for PG15 development cycle.
It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, refactoring and was rebased to PG15.
Main changes:
- Change TransactionId to 64bit
- Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax
remains 32bit
-- 32bit xid is named ShortTransactionId now.
-- Exception: see "double xmax" format below.
- Heap page format is changed to contain xid and multixact base value,
tuple's xmin and xmax are offsets from.
-- xid_base and multi_base are stored as a page special data. PageHeader
remains unmodified.
-- If after upgrade page has no free space for special data, tuples are
converted to "double xmax" format: xmin became virtual
FrozenTransactionId, xmax occupies the whole 64bit.
Page converted to new format when vacuum frees enough space.
- In-memory tuples (HeapTuple) were enriched with copies of xid_base and
multi_base from a page.
ToDo:
- replace xid_base/multi_base in HeapTuple with precalculated 64bit
xmin/xmax.
- attempt to split the patch into "in-memory" part (add xmin/xmax to
HeapTuple) and "storage" format change.
- try to implement the storage part as a table access method.
Your opinions are very much welcome!
[1] https://www.postgresql.org/message-id/flat/1611355191319-0.post%40n3.nabble.com#c884ac33243ded0a47881137c6c96f6b
[2] https://www.postgresql.org/message-id/flat/DA1E65A4-7C5A-461D-B211-2AD5F9A6F2FD%40gmail.com
[3] https://www.postgresql.org/message-id/flat/CAPpHfduQ7KuCHvg3dHx%2B9Pwp_rNf705bjdRCrR_Cqiv_co4H9A%40mail.gmail.com
[4] https://www.postgresql.org/message-id/flat/51957591572599112%40myt5-3a82a06244de.qloud-c.yandex.net
[5] https://www.postgresql.org/message-id/CAPpHfdseWf0QLWMAhLgiyP4u%2B5WUondzdQ_Yd-eeF%3DDuj%3DVq0g%40mail.gmail.com
Attachment
Hi Maxim,
I’m glad to see that you’re trying to carry the 64-bit XID work forward. I had not noticed that my earlier patch (also derived from Alexander Kortkov’s patch) was responded to back in September. Perhaps we can merge some of the code cleanup that it contained, such as using XID_FMT everywhere and creating a type for the kind of page returned by TransactionIdToPage() to make the code cleaner.
Is your patch functionally the same as the PostgresPro implementation? If so, I think it would be helpful for everyone’s understanding to read the PostgresPro documentation on VACUUM. See in particular section “Forced shrinking pg_clog and pg_multixact”
https://postgrespro.com/docs/enterprise/9.6/routine-vacuuming#vacuum-for-wraparound
best regards,
/Jim
Greetings, * Maxim Orlov (orlovmg@gmail.com) wrote: > Long time wraparound was a really big pain for highly loaded systems. One > source of performance degradation is the need to vacuum before every > wraparound. > And there were several proposals to make XIDs 64-bit like [1], [2], [3] and > [4] to name a few. > > The approach [2] seems to have stalled on CF since 2018. But meanwhile it > was successfully being used in our Postgres Pro fork all time since then. > We have hundreds of customers using 64-bit XIDs. Dozens of instances are > under load that require wraparound each 1-5 days with 32-bit XIDs. > It really helps the customers with a huge transaction load that in the case > of 32-bit XIDs could experience wraparounds every day. So I'd like to > propose this approach modification to CF. > > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few > fixes, refactoring and was rebased to PG15. Just to confirm as I only did a quick look- if a transaction in such a high rate system lasts for more than a day (which certainly isn't completely out of the question, I've had week-long transactions before..), and something tries to delete a tuple which has tuples on it that can't be frozen yet due to the long-running transaction- it's just going to fail? Not saying that I've got any idea how to fix that case offhand, and we don't really support such a thing today as the server would just stop instead, but if I saw something in the release notes talking about PG moving to 64bit transaction IDs, I'd probably be pretty surprised to discover that there's still a 32bit limit that you have to watch out for or your system will just start failing transactions. Perhaps that's a worthwhile tradeoff for being able to generally avoid having to vacuum and deal with transaction wrap-around, but I have to wonder if there might be a better answer. Of course, also wonder about how we're going to document and monitor for this potential issue and what kind of corrective action will be needed (kill transactions older than a cerain amount of transactions..?). Thanks, Stephen
Attachment
On 1/4/22, 2:35 PM, "Stephen Frost" <sfrost@snowman.net> wrote: >> >> Not saying that I've got any idea how to fix that case offhand, and we >> don't really support such a thing today as the server would just stop >> instead, ... >> Perhaps that's a >> worthwhile tradeoff for being able to generally avoid having to vacuum >> and deal with transaction wrap-around, but I have to wonder if there >> might be a better answer. >> For the target use cases that PostgreSQL is designed for, it's a very worthwhile tradeoff in my opinion. Such long-runningtransactions need to be killed. Re: -- If after upgrade page has no free space for special data, tuples are converted to "double xmax" format: xmin became virtual FrozenTransactionId, xmax occupies the whole 64bit. Page converted to new format when vacuum frees enough space. I'm concerned about the maintainability impact of having 2 new on-disk page formats. It's already complex enough with XIDsand multixact-XIDs. If the lack of space for the two epochs in the special data area is a problem only in an upgrade scenario, why not resolvethe problem before completing the upgrade process like a kind of post-process pg_repack operation that converts all"double xmax" pages to the "double-epoch" page format? i.e. maybe the "double xmax" representation is needed as an intermediaterepresentation during upgrade, but after upgrade completes successfully there are no pages with the "double-xmax"representation. This would eliminate a whole class of coding errors and would make the code dealing with 64-bitXIDs simpler and more maintainable. /Jim
On 2021/12/30 21:15, Maxim Orlov wrote: > Hi, hackers! > > Long time wraparound was a really big pain for highly loaded systems. One source of performance degradation is the needto vacuum before every wraparound. > And there were several proposals to make XIDs 64-bit like [1], [2], [3] and [4] to name a few. > > The approach [2] seems to have stalled on CF since 2018. But meanwhile it was successfully being used in our Postgres Profork all time since then. We have hundreds of customers using 64-bit XIDs. Dozens of instances are under load that requirewraparound each 1-5 days with 32-bit XIDs. > It really helps the customers with a huge transaction load that in the case of 32-bit XIDs could experience wraparoundsevery day. So I'd like to propose this approach modification to CF. > > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, refactoring and was rebased to PG15. Thanks a lot! I'm really happy to see this proposal again!! Is there any documentation or README explaining this whole 64-bit XID mechanism? Could you tell me what happens if new tuple with XID larger than xid_base + 0xFFFFFFFF is inserted into the page? Such newtuple is not allowed to be inserted into that page? Or xid_base and xids of all existing tuples in the page are increased?Also what happens if one of those xids (of existing tuples) cannot be changed because the tuple still can be seenby very-long-running transaction? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, 30 Dec 2021 at 13:19, Maxim Orlov <orlovmg@gmail.com> wrote: > > Hi, hackers! > > Long time wraparound was a really big pain for highly loaded systems. One source of performance degradation is the needto vacuum before every wraparound. > And there were several proposals to make XIDs 64-bit like [1], [2], [3] and [4] to name a few. Very good to see this revived. > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, refactoring and was rebased to PG15. > > Main changes: > - Change TransactionId to 64bit This sounds like a good route to me. > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > remains 32bit > -- 32bit xid is named ShortTransactionId now. > -- Exception: see "double xmax" format below. > - Heap page format is changed to contain xid and multixact base value, > tuple's xmin and xmax are offsets from. > -- xid_base and multi_base are stored as a page special data. PageHeader > remains unmodified. > -- If after upgrade page has no free space for special data, tuples are > converted to "double xmax" format: xmin became virtual > FrozenTransactionId, xmax occupies the whole 64bit. > Page converted to new format when vacuum frees enough space. > - In-memory tuples (HeapTuple) were enriched with copies of xid_base and > multi_base from a page. I think we need more Overview of Behavior than is available with this patch, perhaps in the form of a README, such as in src/backend/access/heap/README.HOT. Most people's comments are about what the opportunities and problems caused, and mine are definitely there also. i.e. explain the user visible behavior. Please explain the various new states that pages can be in and what the effects are, My understanding is this would be backwards compatible, so we can upgrade to it. Please confirm. Thanks -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, Jan 4, 2022 at 10:22:50PM +0000, Finnerty, Jim wrote: > I'm concerned about the maintainability impact of having 2 new > on-disk page formats. It's already complex enough with XIDs and > multixact-XIDs. > > If the lack of space for the two epochs in the special data area is > a problem only in an upgrade scenario, why not resolve the problem > before completing the upgrade process like a kind of post-process > pg_repack operation that converts all "double xmax" pages to > the "double-epoch" page format? i.e. maybe the "double xmax" > representation is needed as an intermediate representation during > upgrade, but after upgrade completes successfully there are no pages > with the "double-xmax" representation. This would eliminate a whole > class of coding errors and would make the code dealing with 64-bit > XIDs simpler and more maintainable. Well, yes, we could do this, and it would avoid the complexity of having to support two XID representations, but we would need to accept that fast pg_upgrade would be impossible in such cases, since every page would need to be checked and potentially updated. You might try to do this while the server is first started and running queries, but I think we found out from the online checkpoint patch that having the server in an intermediate state while running queries is very complex --- it might be simpler to just accept two XID formats all the time than enabling the server to run with two formats for a short period. My big point is that this needs more thought. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Dec 30, 2021 at 03:15:16PM +0300, Maxim Orlov wrote: > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, > refactoring and was rebased to PG15. > > Main changes: > - Change TransactionId to 64bit > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > remains 32bit > -- 32bit xid is named ShortTransactionId now. > -- Exception: see "double xmax" format below. > - Heap page format is changed to contain xid and multixact base value, > tuple's xmin and xmax are offsets from. > -- xid_base and multi_base are stored as a page special data. PageHeader > remains unmodified. > -- If after upgrade page has no free space for special data, tuples are > converted to "double xmax" format: xmin became virtual > FrozenTransactionId, xmax occupies the whole 64bit. > Page converted to new format when vacuum frees enough space. I think it is a great idea to allow the 64-XID to span the 32-bit xmin and xmax fields when needed. It would be nice if we can get focus on this feature so we are sure it gets into PG 15. Can we add this patch incrementally so people can more easily analyze it? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Tue, Jan 4, 2022 at 05:49:07PM +0000, Finnerty, Jim wrote: > Hi Maxim, > I’m glad to see that you’re trying to carry the 64-bit XID work forward. I > had not noticed that my earlier patch (also derived from Alexander Kortkov’s > patch) was responded to back in September. Perhaps we can merge some of the > code cleanup that it contained, such as using XID_FMT everywhere and creating a > type for the kind of page returned by TransactionIdToPage() to make the code > cleaner. > > Is your patch functionally the same as the PostgresPro implementation? If > so, I think it would be helpful for everyone’s understanding to read the > PostgresPro documentation on VACUUM. See in particular section “Forced > shrinking pg_clog and pg_multixact” > > https://postgrespro.com/docs/enterprise/9.6/routine-vacuuming# > vacuum-for-wraparound Good point --- we still need vacuum freeze. It would be good to understand how much value we get in allowing vacuum freeze to be done less often --- how big can pg_xact/pg_multixact get before they are problems? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi! On Thu, Jan 6, 2022 at 3:02 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Dec 30, 2021 at 03:15:16PM +0300, Maxim Orlov wrote: > > PFA updated working patch v6 for PG15 development cycle. > > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, > > refactoring and was rebased to PG15. > > > > Main changes: > > - Change TransactionId to 64bit > > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > > remains 32bit > > -- 32bit xid is named ShortTransactionId now. > > -- Exception: see "double xmax" format below. > > - Heap page format is changed to contain xid and multixact base value, > > tuple's xmin and xmax are offsets from. > > -- xid_base and multi_base are stored as a page special data. PageHeader > > remains unmodified. > > -- If after upgrade page has no free space for special data, tuples are > > converted to "double xmax" format: xmin became virtual > > FrozenTransactionId, xmax occupies the whole 64bit. > > Page converted to new format when vacuum frees enough space. > > I think it is a great idea to allow the 64-XID to span the 32-bit xmin > and xmax fields when needed. It would be nice if we can get focus on > this feature so we are sure it gets into PG 15. Can we add this patch > incrementally so people can more easily analyze it? I see at least the following major issues/questions in this patch. 1) Current code relies on the fact that TransactionId can be atomically read from/written to shared memory. With 32-bit systems and 64-bit TransactionId, that's not true anymore. Therefore, the patch has concurrency issues on 32-bit systems. We need to carefully review these issues and provide a fallback for 32-bit systems. I suppose nobody is thinking about dropping off 32-bit systems, right? Also, I wonder how critical for us is the overhead for 32-bit systems. They are going to become legacy, so overhead isn't so critical, right? 2) With this patch we still need to freeze to cut SLRUs. This is especially problematic with Multixacts, because systems heavily using row-level locks can consume an enormous amount of multixacts. That is especially problematic when we have 2x bigger multixacts. We probably can provide an alternative implementation for multixact vacuum, which doesn't require scanning all the heaps. That is a pretty amount of work though. The clog is much smaller and we can cut it more rarely. Perhaps, we could tolerate freezing to cut clog, couldn't we? 3) 2x bigger in-memory representation of TransactionId have overheads. In particular, it could mitigate the effect of recent advancements from Andres Freund. I'm not exactly sure we should/can do something with this. But I think this should be at least carefully analyzed. 4) SP-GiST index stores TransactionId on pages. Could we tolerate dropping SP-GiST indexes on a major upgrade? Or do we have to invent something smarter? 5) 32-bit limitation within the page mentioned upthread by Stephen Frost should be also carefully considered. Ideally, we should get rid of it, but I don't have particular ideas in this field for now. At least, we should make sure we did our best at error reporting and monitoring capabilities. I think the realistic goal for PG 15 development cycle would be agreement on a roadmap for all the items above (and probably some initial implementations). ------ Regards, Alexander Korotkov
On Thu, 30 Dec 2021 at 13:19, Maxim Orlov <orlovmg@gmail.com> wrote: > Your opinions are very much welcome! This is a review of the Int64 options patch, "v6-0001-Add-64-bit-GUCs-for-xids.patch" Applies cleanly, with some fuzz, compiles cleanly and passes make check. Patch eyeballs OK, no obvious defects. Tested using the attached test, so seems to work correctly. On review of docs, no additions or changes required. Perhaps add something to README? If so, minor doc patch attached. Otherwise, this sub-patch is READY FOR COMMITTER. -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
(Maxim) Re: -- If after upgrade page has no free space for special data, tuples are converted to "double xmax" format: xmin became virtual FrozenTransactionId, xmax occupies the whole 64bit. Page converted to new format when vacuum frees enough space. A better way would be to prepare the database for conversion to the 64-bit XID format before the upgrade so that it ensuresthat every page has enough room for the two new epochs (E bits). 1. Enforce the rule that no INSERT or UPDATE to an existing page will leave less than E bits of free space on a heap page 2. Run an online and restartable task, analogous to pg_repack, that rewrites and splits any page that has less than E bitsof free space. This needs to be run on all non-temp tables in all schemas in all databases. DDL operations are not allowedon a target table while this operation runs, which is enforced by taking an ACCESS SHARE lock on each table whilethe process is running. To mitigate the effects of this restriction, the restartable task can be restricted to run onlyin certain hours. This could be implemented as a background maintenance task that runs for X hours as of a certain timeof day and then kicks itself off again in 24-X hours, logging its progress. When this task completes, the database is ready for upgrade to 64-bit XIDs, and there is no possibility that any page hasinsufficient free space for the special data. Would you agree that this approach would completely eliminate the need for a "double xmax" representation? /Jim
On Thu, 6 Jan 2022 at 13:15, Finnerty, Jim <jfinnert@amazon.com> wrote: > > (Maxim) Re: -- If after upgrade page has no free space for special data, tuples are > converted to "double xmax" format: xmin became virtual > FrozenTransactionId, xmax occupies the whole 64bit. > Page converted to new format when vacuum frees enough space. > > A better way would be to prepare the database for conversion to the 64-bit XID format before the upgrade so that it ensuresthat every page has enough room for the two new epochs (E bits). Most software has a one-stage upgrade model. What you propose would have us install 2 things, with a step in-between, which makes it harder to manage. > 1. Enforce the rule that no INSERT or UPDATE to an existing page will leave less than E bits of free space on a heap page > > 2. Run an online and restartable task, analogous to pg_repack, that rewrites and splits any page that has less than E bitsof free space. This needs to be run on all non-temp tables in all schemas in all databases. DDL operations are not allowedon a target table while this operation runs, which is enforced by taking an ACCESS SHARE lock on each table whilethe process is running. To mitigate the effects of this restriction, the restartable task can be restricted to run onlyin certain hours. This could be implemented as a background maintenance task that runs for X hours as of a certain timeof day and then kicks itself off again in 24-X hours, logging its progress. > > When this task completes, the database is ready for upgrade to 64-bit XIDs, and there is no possibility that any page hasinsufficient free space for the special data. > > Would you agree that this approach would completely eliminate the need for a "double xmax" representation? I agree about the idea for scanning existing data blocks, but why not do this AFTER upgrade? 1. Upgrade, with important aspect not-enabled-yet, but everything else working - all required software is delivered in one shot, with fast upgrade 2. As each table is VACUUMed, we confirm/clean/groom data blocks so each table is individually confirmed as being ready. The pace that this happens at is under user control. 3. When all tables have been prepared, then restart to allow xid64 format usage -- Simon Riggs http://www.EnterpriseDB.com/
Re: Most software has a one-stage upgrade model. What you propose would have us install 2 things, with a step in-between, which makes it harder to manage. The intended benefit would be that the code doesn't need to handle the possibility of 2 different XID representations forthe indefinite future. I agree that VACUUM would be the preferred tool to make room for the special data area so that there is no need to installa separate tool, though, whether this work happens before or after the upgrade. Re: 1. Upgrade, with important aspect not-enabled-yet, but everything else working - all required software is delivered inone shot, with fast upgrade Let's clarify what happens during upgrade. What format are the pages in immediately after the upgrade? 2. As each table is VACUUMed, we confirm/clean/groom data blocks so each table is individually confirmed as being ready. The pace that this happens at is under user control. What are VACUUM's new responsibilities in this phase? VACUUM needs a new task that confirms when there exists no heap pagefor a table that is not ready. If upgrade put all the pages into either double-xmax or double-epoch representation, then VACUUM's responsibility could beto split the double-xmax pages into the double-epoch representation and verify when there exists no double-xmax pages. 3. When all tables have been prepared, then restart to allow xid64 format usage Let's also clarify what happens at restart time. If we were to do the upgrade before preparing space in advance, is there a way to ever remove the code that knows about thedouble-xmax XID representation?
On Tue, Jan 4, 2022 at 9:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Could you tell me what happens if new tuple with XID larger than xid_base + 0xFFFFFFFF is inserted into the page? Suchnew tuple is not allowed to be inserted into that page? I fear that this patch will have many bugs along these lines. Example: Why is it okay that convert_page() may have to defragment a heap page, without holding a cleanup lock? That will subtly break code that holds a pin on the buffer, when a tuple slot contains a C pointer to a HeapTuple in shared memory (though only if we get unlucky). Currently we are very permissive about what XID/backend can (or cannot) consume a specific piece of free space from a specific heap page in code like RelationGetBufferForTuple(). It would be very hard to enforce a policy like "your XID cannot insert onto this particular heap page" with the current FSM design. (I actually think that we should have a FSM that supports these requirements, but that's a big project -- a long term goal of mine [1].) > Or xid_base and xids of all existing tuples in the page are increased? Also what happens if one of those xids (of existingtuples) cannot be changed because the tuple still can be seen by very-long-running transaction? That doesn't work in the general case, I think. How could it, unless we truly had 64-bit XIDs in heap tuple headers? You can't necessarily freeze to fix the problem, because we can only freeze XIDs that 1.) committed, and 2.) are visible to every possible MVCC snapshot. (I think you were alluding to this problem yourself.) I believe that a good solution to the problem that this patch tries to solve needs to be more ambitious. I think that we need to return to first principles, rather than extending what we have already. Currently, we store XIDs in tuple headers so that we can determine the tuple's visibility status, based on whether the XID committed (or aborted), and where our snapshot sees the XID as "in the past" (in the case of an inserted tuple's xmin). You could say that XIDs from tuple headers exist so we can see differences *between* tuples. But these differences are typically not useful/interesting for very long. 32-bit XIDs are sometimes not wide enough, but usually they're "too wide": Why should we need to consider an old XID (e.g. do clog lookups) at all, barring extreme cases? Why do we need to keep any kind of metadata about transactions around for a long time? Postgres has not supported time travel in 25 years! If we eagerly cleaned-up aborted transactions with a special kind of VACUUM (which would remove aborted XIDs), we could also maintain a structure that indicates if all of the XIDs on a heap page are known "all committed" implicitly (no dirtying the page, no hint bits, etc) -- something a little like the visibility map, that is mostly set implicitly (not during VACUUM). That doesn't fix the wraparound problem itself, of course. But it enables a design that imposes the same problem on the specific old snapshot instead -- something like a "snapshot too old" error is much better than a system-wide wraparound failure. That approach is definitely very hard, and also requires a smart FSM along the lines described in [1], but it seems like the best way forward. As I pointed out already, freezing is bad because it imposes the requirement that everybody considers an affected XID committed and visible, which is brittle (e.g., old snapshots can cause wraparound failure). More generally, we rely too much on explicitly maintaining "absolute" metadata inline, when we should implicitly maintain "relative" metadata (that can be discarded quickly and without concern for old snapshots). We need to be more disciplined about what XIDs can modify what heap pages in the first place (in code like hio.c and the FSM) to make all this work. [1] https://www.postgresql.org/message-id/CAH2-Wz%3DzEV4y_wxh-A_EvKxeAoCMdquYMHABEh_kZO1rk3a-gw%40mail.gmail.com -- Peter Geoghegan
,On Thu, Jan 6, 2022 at 4:15 PM Finnerty, Jim <jfinnert@amazon.com> wrote: > (Maxim) Re: -- If after upgrade page has no free space for special data, tuples are > converted to "double xmax" format: xmin became virtual > FrozenTransactionId, xmax occupies the whole 64bit. > Page converted to new format when vacuum frees enough space. > > A better way would be to prepare the database for conversion to the 64-bit XID format before the upgrade so that it ensuresthat every page has enough room for the two new epochs (E bits). > > 1. Enforce the rule that no INSERT or UPDATE to an existing page will leave less than E bits of free space on a heap page > > 2. Run an online and restartable task, analogous to pg_repack, that rewrites and splits any page that has less than E bitsof free space. This needs to be run on all non-temp tables in all schemas in all databases. DDL operations are not allowedon a target table while this operation runs, which is enforced by taking an ACCESS SHARE lock on each table whilethe process is running. To mitigate the effects of this restriction, the restartable task can be restricted to run onlyin certain hours. This could be implemented as a background maintenance task that runs for X hours as of a certain timeof day and then kicks itself off again in 24-X hours, logging its progress. > > When this task completes, the database is ready for upgrade to 64-bit XIDs, and there is no possibility that any page hasinsufficient free space for the special data. > > Would you agree that this approach would completely eliminate the need for a "double xmax" representation? The "prepare" approach was the first tried. https://github.com/postgrespro/pg_pageprep But it appears to be very difficult and unreliable. After investing many months into pg_pageprep, "double xmax" approach appears to be very fast to implement and reliable. ------ Regards, Alexander Korotkov
On 2022/01/06 19:24, Simon Riggs wrote: > On Thu, 30 Dec 2021 at 13:19, Maxim Orlov <orlovmg@gmail.com> wrote: > >> Your opinions are very much welcome! > > This is a review of the Int64 options patch, > "v6-0001-Add-64-bit-GUCs-for-xids.patch" Do we really need to support both int32 and int64 options? Isn't it enough to replace the existing int32 option with int64one? Or how about using string-type option for very large number like 64-bit XID, like it's done for recovery_target_xid? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: clog page numbers, as returned by TransactionIdToPage - int pageno = TransactionIdToPage(xid); /* get page of parent */ + int64 pageno = TransactionIdToPage(xid); /* get page of parent */ ... - int pageno = TransactionIdToPage(subxids[0]); + int64 pageno = TransactionIdToPage(subxids[0]); int offset = 0; int i = 0; ... - int nextpageno; + int64 nextpageno; Etc. In all those places where you are replacing int with int64 for the kind of values returned by TransactionIdToPage(), wouldyou mind replacing the int64's with a type name, such as ClogPageNumber, for improved code maintainability?
Re: The "prepare" approach was the first tried. https://github.com/postgrespro/pg_pageprep But it appears to be very difficult and unreliable. After investing many months into pg_pageprep, "double xmax" approach appears to be very fast to implement and reliable. I'd still like a plan to retire the "double xmax" representation eventually. Previously I suggested that this could be doneas a post-process, before upgrade is complete, but that could potentially make upgrade very slow. Another way to retire the "double xmax" representation eventually could be to disallow "double xmax" pages in subsequentmajor version upgrades (e.g. to PG16, if "double xmax" pages are introduced in PG15). This gives the luxury oftime after a fast upgrade to convert all pages to contain the epochs, while still providing a path to more maintainablecode in the future.
On Fri, Jan 07, 2022 at 03:53:51PM +0000, Finnerty, Jim wrote: > I'd still like a plan to retire the "double xmax" representation eventually. Previously I suggested that this could bedone as a post-process, before upgrade is complete, but that could potentially make upgrade very slow. > > Another way to retire the "double xmax" representation eventually could be to disallow "double xmax" pages in subsequentmajor version upgrades (e.g. to PG16, if "double xmax" pages are introduced in PG15). This gives the luxury oftime after a fast upgrade to convert all pages to contain the epochs, while still providing a path to more maintainablecode in the future. Yes, but how are you planning to rewrite it? Is vacuum enough? I suppose it'd need FREEZE + DISABLE_PAGE_SKIPPING ? This would preclude upgrading "across" v15. Maybe that'd be okay, but it'd be a new and atypical restriction. How would you enforce that it'd been run on v15 before upgrading to pg16 ? You'd need to track whether vacuum had completed the necessary steps in pg15. I don't think it'd be okay to make pg_upgrade --check to read every tuple. The "keeping track" part is what reminds me of the online checksum patch. It'd be ideal if there were a generic solution to this kind of task, or at least a "model" process to follow. -- Justin
On 07.01.22 06:18, Fujii Masao wrote: > On 2022/01/06 19:24, Simon Riggs wrote: >> On Thu, 30 Dec 2021 at 13:19, Maxim Orlov <orlovmg@gmail.com> wrote: >> >>> Your opinions are very much welcome! >> >> This is a review of the Int64 options patch, >> "v6-0001-Add-64-bit-GUCs-for-xids.patch" > > Do we really need to support both int32 and int64 options? Isn't it > enough to replace the existing int32 option with int64 one? I think that would create a lot of problems. You'd have to change every underlying int variable to int64, and then check whether that causes any issues where they are used (wrong printf format, assignments, overflows), and you'd have to check whether the existing limits are still appropriate. And extensions would be upset. This would be a big mess. > Or how about > using string-type option for very large number like 64-bit XID, like > it's done for recovery_target_xid? Seeing how many variables that contain transaction ID information actually exist, I think it could be worth introducing a new category as proposed. Otherwise, you'd have to write a lot of check and assign hooks. I do wonder whether signed vs. unsigned is handled correctly. Transaction IDs are unsigned, but all GUC handling is signed.
On Fri, Jan 7, 2022 at 03:53:51PM +0000, Finnerty, Jim wrote: > Re: The "prepare" approach was the first tried. > https://github.com/postgrespro/pg_pageprep But it appears to be > very difficult and unreliable. After investing many months into > pg_pageprep, "double xmax" approach appears to be very fast to > implement and reliable. > > I'd still like a plan to retire the "double xmax" representation > eventually. Previously I suggested that this could be done as a > post-process, before upgrade is complete, but that could potentially > make upgrade very slow. > > Another way to retire the "double xmax" representation eventually > could be to disallow "double xmax" pages in subsequent major version > upgrades (e.g. to PG16, if "double xmax" pages are introduced in > PG15). This gives the luxury of time after a fast upgrade to convert > all pages to contain the epochs, while still providing a path to more > maintainable code in the future. This gets into the entire issue we have discussed in the past but never resolved --- how do we manage state changes in the Postgres file format while the server is running? pg_upgrade and pg_checksums avoid the problem by doing such changes while the server is down, and other file formats have avoided it by allowing perpetual reading of the old format. Any such non-perpetual changes while the server is running must deal with recording the start of the state change, the completion of it, communicating such state changes to all running backends in a synchronous manner, and possible restarting of the state change. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Fri, 7 Jan 2022 at 16:09, Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Fri, Jan 07, 2022 at 03:53:51PM +0000, Finnerty, Jim wrote: > > I'd still like a plan to retire the "double xmax" representation eventually. Previously I suggested that this couldbe done as a post-process, before upgrade is complete, but that could potentially make upgrade very slow. > > > > Another way to retire the "double xmax" representation eventually could be to disallow "double xmax" pages in subsequentmajor version upgrades (e.g. to PG16, if "double xmax" pages are introduced in PG15). This gives the luxury oftime after a fast upgrade to convert all pages to contain the epochs, while still providing a path to more maintainablecode in the future. > > Yes, but how are you planning to rewrite it? Is vacuum enough? Probably not, but VACUUM is the place to add such code. > I suppose it'd need FREEZE + DISABLE_PAGE_SKIPPING ? Yes > This would preclude upgrading "across" v15. Maybe that'd be okay, but it'd be > a new and atypical restriction. I don't see that restriction. Anyone upgrading from before PG15 would apply the transform. Just because we introduce a transform in PG15 doesn't mean we can't apply that transform in later releases as well, to allow say PG14 -> PG16. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jan 6, 2022 at 3:45 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 4, 2022 at 9:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > Could you tell me what happens if new tuple with XID larger than xid_base + 0xFFFFFFFF is inserted into the page? Suchnew tuple is not allowed to be inserted into that page? > > I fear that this patch will have many bugs along these lines. Example: > Why is it okay that convert_page() may have to defragment a heap page, > without holding a cleanup lock? That will subtly break code that holds > a pin on the buffer, when a tuple slot contains a C pointer to a > HeapTuple in shared memory (though only if we get unlucky). Yeah. I think it's possible that some approach along the lines of what is proposed here can work, but quality of implementation is a big issue. This stuff is not easy to get right. Another thing that I'm wondering about is the "double xmax" representation. That requires some way of distinguishing when that representation is in use. I'd be curious to know where we found the bits for that -- the tuple header isn't exactly replete with extra bit space. Also, if we have an epoch of some sort that is included in new page headers but not old ones, that adds branches to code that might sometimes be quite hot. I don't know how much of a problem that is, but it seems worth worrying about. For all of that, I don't particularly agree with Jim Finnerty's idea that we ought to solve the problem by forcing sufficient space to exist in the page pre-upgrade. There are some advantages to such approaches, but they make it really hard to roll out changes. You have to roll out the enabling change first, wait until everyone is running a release that supports it, and only then release the technology that requires the additional page space. Since we don't put new features into back-branches -- and an on-disk format change would be a poor place to start -- that would make rolling something like this out take many years. I think we'll be much happier putting all the complexity in the new release. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jan 5, 2022 at 9:53 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > I see at least the following major issues/questions in this patch. > 1) Current code relies on the fact that TransactionId can be > atomically read from/written to shared memory. With 32-bit systems > and 64-bit TransactionId, that's not true anymore. Therefore, the > patch has concurrency issues on 32-bit systems. We need to carefully > review these issues and provide a fallback for 32-bit systems. I > suppose nobody is thinking about dropping off 32-bit systems, right? I think that's right. Not yet, anyway. > Also, I wonder how critical for us is the overhead for 32-bit systems. > They are going to become legacy, so overhead isn't so critical, right? Agreed. > 2) With this patch we still need to freeze to cut SLRUs. This is > especially problematic with Multixacts, because systems heavily using > row-level locks can consume an enormous amount of multixacts. That is > especially problematic when we have 2x bigger multixacts. We probably > can provide an alternative implementation for multixact vacuum, which > doesn't require scanning all the heaps. That is a pretty amount of > work though. The clog is much smaller and we can cut it more rarely. > Perhaps, we could tolerate freezing to cut clog, couldn't we? Right. We can't let any of the SLRUs -- don't forget about stuff like pg_subtrans, which is a multiple of the size of clog -- grow without bound, even if it never forces a system shutdown. I'm not sure it's a good idea to think about introducing new freezing mechanisms at the same time as we're making other changes, though. Just removing the possibility of a wraparound shutdown without changing any of the rules about how and when we freeze would be a significant advancement. Other changes could be left for future work. > 3) 2x bigger in-memory representation of TransactionId have overheads. > In particular, it could mitigate the effect of recent advancements > from Andres Freund. I'm not exactly sure we should/can do something > with this. But I think this should be at least carefully analyzed. Seems fair. > 4) SP-GiST index stores TransactionId on pages. Could we tolerate > dropping SP-GiST indexes on a major upgrade? Or do we have to invent > something smarter? Probably depends on how much work it is. SP-GiST indexes are not mainstream, so I think we could at least consider breaking compatibility, but it doesn't seem like a thing to do lightly. > 5) 32-bit limitation within the page mentioned upthread by Stephen > Frost should be also carefully considered. Ideally, we should get rid > of it, but I don't have particular ideas in this field for now. At > least, we should make sure we did our best at error reporting and > monitoring capabilities. I don't think I understand the thinking here. As long as we retain the existing limit that the oldest running XID can't be more than 2 billion XIDs in the past, we can't ever need to throw an error. A new page modification that finds very old XIDs on the page can always escape trouble by pruning the page and freezing whatever old XIDs survive pruning. I would argue that it's smarter not to change the in-memory representation of XIDs to 64-bit in places like the ProcArray. As you mention in (4), that might hurt performance. But also, the benefit is minimal. Nobody is really sad that they can't keep transactions open forever. They are sad because the system has severe bloat and/or shuts down entirely. Some kind of change along these lines can fix the second of those problems, and that's progress. > I think the realistic goal for PG 15 development cycle would be > agreement on a roadmap for all the items above (and probably some > initial implementations). +1. Trying to rush something through to commit is just going to result in a bunch of bugs. We need to work through the issues carefully and take the time to do it well. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Jan 5, 2022 at 9:53 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > 5) 32-bit limitation within the page mentioned upthread by Stephen > > Frost should be also carefully considered. Ideally, we should get rid > > of it, but I don't have particular ideas in this field for now. At > > least, we should make sure we did our best at error reporting and > > monitoring capabilities. > > I don't think I understand the thinking here. As long as we retain the > existing limit that the oldest running XID can't be more than 2 > billion XIDs in the past, we can't ever need to throw an error. A new > page modification that finds very old XIDs on the page can always > escape trouble by pruning the page and freezing whatever old XIDs > survive pruning. So we'll just fail such an old transaction? Or force a server restart? or..? What if we try to signal that transaction and it doesn't go away? > I would argue that it's smarter not to change the in-memory > representation of XIDs to 64-bit in places like the ProcArray. As you > mention in (4), that might hurt performance. But also, the benefit is > minimal. Nobody is really sad that they can't keep transactions open > forever. They are sad because the system has severe bloat and/or shuts > down entirely. Some kind of change along these lines can fix the > second of those problems, and that's progress. I brought up the concern that I did because I would be a bit sad if I couldn't have a transaction open for a day on a very high rate system of the type being discussed here. Would be fantastic if we had a solution to that issue, but I get that reducing the need to vacuum and such would be a really nice improvement even if we can't make long running transactions work. Then again, if we do actually change the in-memory bits- then maybe we could have such a long running transaction, provided it didn't try to make an update to a page with really old xids on it, which might be entirely reasonable in a lot of cases. I do still worry about how we explain what the limitation here is and how to avoid hitting it. Definitely seems like a 'gotcha' that people are going to complain about, though hopefully not as much of one as the current cases we hear about of vacuum falling behind and the system running out of xids. > > I think the realistic goal for PG 15 development cycle would be > > agreement on a roadmap for all the items above (and probably some > > initial implementations). > > +1. Trying to rush something through to commit is just going to result > in a bunch of bugs. We need to work through the issues carefully and > take the time to do it well. +1. Thanks, Stephen
Attachment
I'd be
curious to know where we found the bits for that -- the tuple header
isn't exactly replete with extra bit space.
Perhaps we can merge some of the code cleanup that it contained, such as using XID_FMT everywhere and creating a type for the kind of page returned by TransactionIdToPage() to make the code cleaner.
Is your patch functionally the same as the PostgresPro implementation?
Is there any documentation or README explaining this whole 64-bit XID mechanism?
Could you tell me what happens if new tuple with XID larger than xid_base + 0xFFFFFFFF is inserted into the page? Such new tuple is not allowed to be inserted into that page? Or xid_base and xids of all existing tuples in the page are increased? Also what happens if one of those xids (of existing tuples) cannot be changed because the tuple still can be seen by very-long-running transaction?
On Sat, 8 Jan 2022 at 08:21, Maxim Orlov <orlovmg@gmail.com> wrote: >> >> Perhaps we can merge some of the code cleanup that it contained, such as using XID_FMT everywhere and creating a typefor the kind of page returned by TransactionIdToPage() to make the code cleaner. > > > Agree, I think this is a good idea. Looks to me like the best next actions would be: 1. Submit a patch that uses XID_FMT everywhere, as a cosmetic change. This looks like it will reduce the main patch size considerably and make it much less scary. That can be cleaned up and committed while we discuss the main approach. 2. Write up the approach in a detailed README, so people can understand the proposal and assess if there are problems. A few short notes and a link back to old conversations isn't enough to allow wide review and give confidence on such a major patch. -- Simon Riggs http://www.EnterpriseDB.com/
Re: patch that uses XID_FMT everywhere ... to make the main patch much smaller That's exactly what my previous patch did, plus the patch to support 64-bit GUCs. Maxim, maybe it's still a good idea to isolate those two patches and submit them separately first, to reduce the size ofthe rest of the patch? On 1/12/22, 8:28 AM, "Simon Riggs" <simon.riggs@enterprisedb.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. On Sat, 8 Jan 2022 at 08:21, Maxim Orlov <orlovmg@gmail.com> wrote: >> >> Perhaps we can merge some of the code cleanup that it contained, such as using XID_FMT everywhere and creatinga type for the kind of page returned by TransactionIdToPage() to make the code cleaner. > > > Agree, I think this is a good idea. Looks to me like the best next actions would be: 1. Submit a patch that uses XID_FMT everywhere, as a cosmetic change. This looks like it will reduce the main patch size considerably and make it much less scary. That can be cleaned up and committed while we discuss the main approach. 2. Write up the approach in a detailed README, so people can understand the proposal and assess if there are problems. A few short notes and a link back to old conversations isn't enough to allow wide review and give confidence on such a major patch. -- Simon Riggs http://www.EnterpriseDB.com/
Maxim, maybe it's still a good idea to isolate those two patches and submit them separately first, to reduce the size of the rest of the patch?
Looks to me like the best next actions would be:
1. Submit a patch that uses XID_FMT everywhere, as a cosmetic change.
This looks like it will reduce the main patch size considerably and
make it much less scary. That can be cleaned up and committed while we
discuss the main approach.
2. Write up the approach in a detailed README, so people can
understand the proposal and assess if there are problems. A few short
notes and a link back to old conversations isn't enough to allow wide
review and give confidence on such a major patch.
We intend to do the following work on the patch soon:1. Write a detailed README2. Split the patch into several pieces including a separate part for XID_FMT. But if committers meanwhile choose to commit Jim's XID_FMT patch we also appreciate this and will rebase our patch accordingly.2A. Probably refactor it to store precalculated XMIN/XMAX in memory tuple representation instead of t_xid_base/t_multi_base2B. Split the in-memory part of a patch as a separate3. Construct some variants for leaving "double xmax" format as a temporary one just after upgrade for having only one persistent on-disk format instead of two.3A. By using SQL function "vacuum doublexmax;"OR3B. By freeing space on all heap pages for pd_special before pg-upgrade.OR3C. By automatically repacking all "double xmax" pages after upgrade (with a priority specified by common vacuum-related GUCs)4. Intentionally prohibit starting a new transaction with XID difference of more than 2^32 from the oldest currently running one. This is to enforce some dba's action for cleaning defunct transaction but not binding one: he/she can wait if they consider these old transactions not defunct.5. Investigate and add a solution for archs without 64-bit atomic values.5A. Provide XID 8-byte alignment for systems where 64-bit atomics is provided for 8-byte aligned values.5B. Wrap XID reading into PG atomic locks for remaining 32-bit ones (they are expected to be rare).
Attachment
Hi, On Fri, Jan 14, 2022 at 11:38:46PM +0400, Pavel Borisov wrote: > > PFA patch with README for 64xid proposal. It is 0003 patch of the same v6, > that was proposed earlier [1]. > As always, I very much appreciate your ideas on this readme patch, on > overall 64xid patch [1], and on the roadmap on its improvement quoted above. Thanks for adding this documentation! Unfortunately, the cfbot can't apply a patchset split in multiple emails, so for now you only get coverage for this new readme file, as this is what's being tested on the CI: https://github.com/postgresql-cfbot/postgresql/commit/f8f12ce29344bc7c72665c334b5eb40cee22becd Could you send the full patchset each time you make a modification? For now I'm simply attaching 0001, 0002 and 0003 to make sure that the cfbot will pick all current patches on its next run.
Attachment
I tried to pg_upgrade from a v13 instance like: time make check -C src/bin/pg_upgrade oldsrc=`pwd`/13 oldbindir=`pwd`/13/tmp_install/usr/local/pgsql/bin I had compiled and installed v13 into `pwd`/13. First, test.sh failed, because of an option in initdb which doesn't exist in the old version: -x 21000000000 I patched test.sh so the option is used only the "new" version. The tab_core_types table has an XID column, so pg_upgrade --check complains and refuses to run. If I drop it, then pg_upgrade runs, but then fails like this: |Files /home/pryzbyj/src/postgres/src/bin/pg_upgrade/tmp_check/new_xids.txt and /home/pryzbyj/src/postgres/src/bin/pg_upgrade/tmp_check/old_xids.txtdiffer |See /home/pryzbyj/src/postgres/src/bin/pg_upgrade/tmp_check/xids.diff | |--- /home/pryzbyj/src/postgres/src/bin/pg_upgrade/tmp_check/new_xids.txt 2022-01-15 00:14:23.035294414 -0600 |+++ /home/pryzbyj/src/postgres/src/bin/pg_upgrade/tmp_check/old_xids.txt 2022-01-15 00:13:59.634945012 -0600 |@@ -1,5 +1,5 @@ | relfrozenxid | relminmxid | --------------+------------ |- 3 | 3 |+ 15594 | 3 | (1 row) Also, the patch needs to be rebased over Peter's vacuum changes. Here's the changes I used for my test: diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh index c6361e3c085..5eae42192b6 100644 --- a/src/bin/pg_upgrade/test.sh +++ b/src/bin/pg_upgrade/test.sh @@ -24,7 +24,8 @@ standard_initdb() { # without increasing test runtime, run these tests with a custom setting. # Also, specify "-A trust" explicitly to suppress initdb's warning. # --allow-group-access and --wal-segsize have been added in v11. - "$1" -N --wal-segsize 1 --allow-group-access -A trust -x 21000000000 + "$@" -N --wal-segsize 1 --allow-group-access -A trust + if [ -n "$TEMP_CONFIG" -a -r "$TEMP_CONFIG" ] then cat "$TEMP_CONFIG" >> "$PGDATA/postgresql.conf" @@ -237,7 +238,7 @@ fi PGDATA="$BASE_PGDATA" -standard_initdb 'initdb' +standard_initdb 'initdb' -x 21000000000 pg_upgrade $PG_UPGRADE_OPTS --no-sync -d "${PGDATA}.old" -D "$PGDATA" -b "$oldbindir" -p "$PGPORT" -P "$PGPORT" diff --git a/src/bin/pg_upgrade/upgrade_adapt.sql b/src/bin/pg_upgrade/upgrade_adapt.sql index 27c4c7fd011..c5ce8bc95b2 100644 --- a/src/bin/pg_upgrade/upgrade_adapt.sql +++ b/src/bin/pg_upgrade/upgrade_adapt.sql @@ -89,3 +89,5 @@ DROP OPERATOR public.#%# (pg_catalog.int8, NONE); DROP OPERATOR public.!=- (pg_catalog.int8, NONE); DROP OPERATOR public.#@%# (pg_catalog.int8, NONE); \endif + +DROP TABLE IF EXISTS tab_core_types;
On Wed, Jan 05, 2022 at 06:51:37PM -0500, Bruce Momjian wrote: > On Tue, Jan 4, 2022 at 10:22:50PM +0000, Finnerty, Jim wrote: [skipped] > > with the "double-xmax" representation. This would eliminate a whole > > class of coding errors and would make the code dealing with 64-bit > > XIDs simpler and more maintainable. > > Well, yes, we could do this, and it would avoid the complexity of having > to support two XID representations, but we would need to accept that > fast pg_upgrade would be impossible in such cases, since every page > would need to be checked and potentially updated. > > You might try to do this while the server is first started and running > queries, but I think we found out from the online checkpoint patch that > having the server in an intermediate state while running queries is very > complex --- it might be simpler to just accept two XID formats all the > time than enabling the server to run with two formats for a short > period. My big point is that this needs more thought. Probably, some table storage housekeeping would be wanted. Like a column in pg_class describing the current set of options of the table: checksums added, 64-bit xids added, type of 64-bit xids (probably some would want to add support for the pgpro up- grades), some set of defaults to not include a lot of them in all pageheaders -- like compressed xid/integer formats or extended pagesize. And separate tables that describe the transition state -- like when adding checksums, the desired state for the relation (checksums), and a set of ranges in the table files that are al- ready transitioned/checked. That probably will not introduce too much slowdown at least on reading, and will add the transition/upgrade mechanics. Aren't there were already some discussions about such a feature in the mailing lists? > > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > If only the physical world exists, free will is an illusion. > >
Attachment
Hi Pavel, > Please feel free to discuss readme and your opinions on the current patch and proposed changes [1]. Just a quick question about this design choice: > On-disk tuple format remains unchanged. 32-bit t_xmin and t_xmax store the > lower parts of 64-bit XMIN and XMAX values. Each heap page has additional > 64-bit pd_xid_base and pd_multi_base which are common for all tuples on a page. > They are placed into a pd_special area - 16 bytes in the end of a heap page. > Actual XMIN/XMAX for a tuple are calculated upon reading a tuple from a page > as follows: > > XMIN = t_xmin + pd_xid_base. > XMAX = t_xmax + pd_xid_base/pd_multi_base. Did you consider using 4 bytes for pd_xid_base and another 4 bytes for (pd_xid_base/pd_multi_base)? This would allow calculating XMIN/XMAX as: XMIN = (t_min_extra_bits << 32) | t_xmin XMAX = (t_max_extra_bits << 32) | t_xmax ... and save 8 extra bytes in the pd_special area. Or maybe I'm missing some context here? -- Best regards, Aleksander Alekseev
Did you consider using 4 bytes for pd_xid_base and another 4 bytes for
(pd_xid_base/pd_multi_base)? This would allow calculating XMIN/XMAX
as:
XMIN = (t_min_extra_bits << 32) | t_xmin
XMAX = (t_max_extra_bits << 32) | t_xmax
... and save 8 extra bytes in the pd_special area. Or maybe I'm
missing some context here?
Hi, On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote: > +64-bit Transaction ID's (XID) > +============================= > + > +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent > +wraparound every N/2 transactions. This causes performance degradation due > +to the need to exclusively lock tables while being vacuumed. In each > +wraparound cycle, SLRU buffers are also being cut. What exclusive lock? > +"Double XMAX" page format > +--------------------------------- > + > +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL > +version pd_special area with a size of 16 bytes should be added to a page. > +Though a page may not have space for this. Then it can be converted to a > +temporary format called "double XMAX". > > +All tuples after pg-upgrade would necessarily have xmin = FrozenTransactionId. Why would a tuple after pg-upgrade necessarily have xmin = FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade itself doesn't do anything to xmins. I guess you mean that the xmin cannot be needed anymore, because no older transaction can be running? > +In-memory tuple format > +---------------------- > + > +In-memory tuple representation consists of two parts: > +- HeapTupleHeader from disk page (contains all heap tuple contents, not only > +header) > +- HeapTuple with additional in-memory fields > + > +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies of > +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from > +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX: > + > +XMIN = t_xmin + t_xid_base. (3) > +XMAX = t_xmax + t_xid_base/t_multi_base. (4) What identifies a HeapTuple as having this additional data? > +The downside of this is that we can not use tuple's XMIN and XMAX right away. > +We often need to re-read t_xmin and t_xmax - which could actually be pointers > +into a page in shared buffers and therefore they could be updated by any other > +backend. Ugh, that's not great. > +Upgrade from 32-bit XID versions > +-------------------------------- > + > +pg_upgrade doesn't change pages format itself. It is done lazily after. > + > +1. At first heap page read, tuples on a page are repacked to free 16 bytes > +at the end of a page, possibly freeing space from dead tuples. That will cause a *massive* torrent of writes after an upgrade. Isn't this practically making pg_upgrade useless? Imagine a huge cluster where most of the pages are all-frozen, upgraded using link mode. What happens if the first access happens on a replica? What is the approach for dealing with multixact files? They have xids embedded? And currently the SLRUs will break if you just let the offsets SLRU grow without bounds. > +void > +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno) > +{ > + PageHeader hdr = (PageHeader) page; > + GenericXLogState *state = NULL; > + Page tmp_page = page; > + uint16 checksum; > + > + if (!rel) > + return; > + > + /* Verify checksum */ > + if (hdr->pd_checksum) > + { > + checksum = pg_checksum_page((char *) page, blkno); > + if (checksum != hdr->pd_checksum) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("page verification failed, calculated checksum %u but expected %u", > + checksum, hdr->pd_checksum))); > + } > + > + /* Start xlog record */ > + if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel)) > + { > + state = GenericXLogStart(rel); > + tmp_page = GenericXLogRegisterBuffer(state, buf, GENERIC_XLOG_FULL_IMAGE); > + } > + > + PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr), > + PG_PAGE_LAYOUT_VERSION); > + > + if (was_32bit_xid(hdr)) > + { > + switch (rel->rd_rel->relkind) > + { > + case 'r': > + case 'p': > + case 't': > + case 'm': > + convert_heap(rel, tmp_page, buf, blkno); > + break; > + case 'i': > + /* no need to convert index */ > + case 'S': > + /* no real need to convert sequences */ > + break; > + default: > + elog(ERROR, > + "Conversion for relkind '%c' is not implemented", > + rel->rd_rel->relkind); > + } > + } > + > + /* > + * Mark buffer dirty unless this is a read-only transaction (e.g. query > + * is running on hot standby instance) > + */ > + if (!XactReadOnly) > + { > + /* Finish xlog record */ > + if (XLogIsNeeded() && RelationNeedsWAL(rel)) > + { > + Assert(state != NULL); > + GenericXLogFinish(state); > + } > + > + MarkBufferDirty(buf); > + } > + > + hdr = (PageHeader) page; > + hdr->pd_checksum = pg_checksum_page((char *) page, blkno); > +} Wait. So you just modify the page without WAL logging or marking it dirty on a standby? I fail to see how that can be correct. Imagine the cluster is promoted, the page is dirtied, and we write it out. You'll have written out a completely changed page, without any WAL logging. There's plenty other scenarios. Greetings, Andres Freund
> +The downside of this is that we can not use tuple's XMIN and XMAX right away.
> +We often need to re-read t_xmin and t_xmax - which could actually be pointers
> +into a page in shared buffers and therefore they could be updated by any other
> +backend.
Ugh, that's not great.
What happens if the first access happens on a replica?
What is the approach for dealing with multixact files? They have xids
embedded? And currently the SLRUs will break if you just let the offsets SLRU
grow without bounds.
Wait. So you just modify the page without WAL logging or marking it dirty on a
standby? I fail to see how that can be correct.
Imagine the cluster is promoted, the page is dirtied, and we write it
out. You'll have written out a completely changed page, without any WAL
logging. There's plenty other scenarios.
Attachment
Hi hackers, > In this part, I suppose you've found a definite bug. Thanks! There are a couple > of ways how it could be fixed: > > 1. If we enforce checkpoint at replica promotion then we force full-page writes after each page modification afterward. > > 2. Maybe it's worth using BufferDesc bit to mark the page as converted to 64xid but not yet written to disk? For example,one of four bits from BUF_USAGECOUNT. > BM_MAX_USAGE_COUNT = 5 so it will be enough 3 bits to store it. This will change in-memory page representation but willnot need WAL-logging which is impossible on a replica. > > What do you think about it? I'm having difficulties merging and/or testing v8-0002-Add-64bit-xid.patch since I'm not 100% sure which commit this patch was targeting. Could you please submit a rebased patch and/or share your development branch on GitHub? I agree with Bruce it would be great to deliver this in PG15. Please let me know if you believe it's unrealistic for any reason so I will focus on testing and reviewing other patches. For now, I'm changing the status of the patch to "Waiting on Author". -- Best regards, Aleksander Alekseev
Hi! Here is the rebased version.
On Wed, Mar 02, 2022 at 06:43:11PM +0400, Pavel Borisov wrote: > Hi hackers! > > Hi! Here is the rebased version. The patch doesn't apply - I suppose the patch is relative a forked postgres which already has other patches. http://cfbot.cputube.org/pavel-borisov.html Note also that I mentioned an issue with pg_upgrade. Handling that that well is probably the most important part of the patch. -- Justin
Hi, On 2022-03-02 16:25:33 +0300, Aleksander Alekseev wrote: > I agree with Bruce it would be great to deliver this in PG15. > Please let me know if you believe it's unrealistic for any reason so I will > focus on testing and reviewing other patches. I don't see 15 as a realistic target for this patch. There's huge amounts of work left, it has gotten very little review. I encourage trying to break down the patch into smaller incrementally useful pieces. E.g. making all the SLRUs 64bit would be a substantial and independently committable piece. Greetings, Andres Freund
Hi hackers, > The patch doesn't apply - I suppose the patch is relative a forked postgres No, the authors just used a little outdated `master` branch. I successfully applied it against 31d8d474 and then rebased to the latest master (62ce0c75). The new version is attached. Not 100% sure if my rebase is correct since I didn't invest too much time into reviewing the code. But at least it passes `make installcheck` locally. Let's see what cfbot will tell us. > I encourage trying to break down the patch into smaller incrementally useful > pieces. E.g. making all the SLRUs 64bit would be a substantial and > independently committable piece. Completely agree. And the changes like: +#if 0 /* XXX remove unit tests */ ... suggest that the patch is pretty raw in its current state. Pavel, Maxim, don't you mind me splitting the patchset, or would you like to do it yourself and/or maybe include more changes? I don't know how actively you are working on this. -- Best regards, Aleksander Alekseev
Attachment
> The patch doesn't apply - I suppose the patch is relative a forked postgres
No, the authors just used a little outdated `master` branch. I
successfully applied it against 31d8d474 and then rebased to the
latest master (62ce0c75). The new version is attached.
Not 100% sure if my rebase is correct since I didn't invest too much
time into reviewing the code. But at least it passes `make
installcheck` locally. Let's see what cfbot will tell us.
> I encourage trying to break down the patch into smaller incrementally useful
> pieces. E.g. making all the SLRUs 64bit would be a substantial and
> independently committable piece.
Completely agree. And the changes like:
+#if 0 /* XXX remove unit tests */
... suggest that the patch is pretty raw in its current state.
Pavel, Maxim, don't you mind me splitting the patchset, or would you
like to do it yourself and/or maybe include more changes? I don't know
how actively you are working on this.
Attachment
Hi Pavel! On Thu, Mar 3, 2022 at 2:35 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > BTW messages with patches in this thread are always invoke manual spam moderation and we need to wait for ~3 hours beforethe message with patch becomes visible in the hackers thread. Now when I've already answered Alexander's letter withv10 patch the very message (and a patch) I've answered is still not visible in the thread and to CFbot. > > Can something be done in hackers' moderation engine to make new versions patches become visible hassle-free? Is your email address subscribed to the pgsql-hackers mailing list? AFAIK, moderation is only applied for non-subscribers. ------ Regards, Alexander Korotkov
> BTW messages with patches in this thread are always invoke manual spam moderation and we need to wait for ~3 hours before the message with patch becomes visible in the hackers thread. Now when I've already answered Alexander's letter with v10 patch the very message (and a patch) I've answered is still not visible in the thread and to CFbot.
>
> Can something be done in hackers' moderation engine to make new versions patches become visible hassle-free?
Is your email address subscribed to the pgsql-hackers mailing list?
AFAIK, moderation is only applied for non-subscribers.
Greetings, * Pavel Borisov (pashkin.elfe@gmail.com) wrote: > > > BTW messages with patches in this thread are always invoke manual spam > > moderation and we need to wait for ~3 hours before the message with patch > > becomes visible in the hackers thread. Now when I've already answered > > Alexander's letter with v10 patch the very message (and a patch) I've > > answered is still not visible in the thread and to CFbot. > > > > > > Can something be done in hackers' moderation engine to make new versions > > patches become visible hassle-free? > > > > Is your email address subscribed to the pgsql-hackers mailing list? > > AFAIK, moderation is only applied for non-subscribers. > > Yes, it is in the list. The problem is that patch is over 1Mb. So it > strictly goes through moderation. And this is unchanged for 2 months > already. Right, >1MB will be moderated, as will emails that are CC'd to multiple lists, and somehow this email thread ended up with two different addresses for -hackers, which isn't good. > I was advised to use .gz, which I will do next time. Better would be to break the patch down into reasonable and independent pieces for review and commit on separate threads as suggested previously and not to send huge patches to the list with the idea that someone is going to actually fully review and commit them. That's just not likely to end up working well anyway. Thanks, Stephen
Attachment
Hi hackers, > We've rebased patchset onto the current master. The result is almost the > same as Alexander's v10 (it is a shame it is still in moderation and not > visible in the thread). Anyway, this is the v11 patch. Reviews are very > welcome. Here is a rebased and slightly modified version of the patch. I extracted the introduction of XID_FMT macro to a separate patch. Also, I noticed that sometimes PRIu64 was used to format XIDs instead. I changed it to XID_FMT for consistency. v12-0003 can be safely delivered in PG15. > I encourage trying to break down the patch into smaller incrementally useful > pieces. E.g. making all the SLRUs 64bit would be a substantial and > independently committable piece. I'm going to address this in follow-up emails. -- Best regards, Aleksander Alekseev
Attachment
Hi hackers, > Here is a rebased and slightly modified version of the patch. > > I extracted the introduction of XID_FMT macro to a separate patch. Also, > I noticed that sometimes PRIu64 was used to format XIDs instead. I changed it > to XID_FMT for consistency. v12-0003 can be safely delivered in PG15. > > > I encourage trying to break down the patch into smaller incrementally useful > > pieces. E.g. making all the SLRUs 64bit would be a substantial and > > independently committable piece. > > I'm going to address this in follow-up emails. cfbot is not happy because several files are missing in v12. Here is a corrected and rebased version. I also removed the "#undef PRIu64" change from include/c.h since previously I replaced PRIu64 usage with XID_FMT. -- Best regards, Aleksander Alekseev
Attachment
Hi hackers, > I extracted the introduction of XID_FMT macro to a separate patch. Also, > I noticed that sometimes PRIu64 was used to format XIDs instead. I changed it > to XID_FMT for consistency. v12-0003 can be safely delivered in PG15. [...] > > > I encourage trying to break down the patch into smaller incrementally useful > > > pieces. E.g. making all the SLRUs 64bit would be a substantial and > > > independently committable piece. > > > > I'm going to address this in follow-up emails. > > cfbot is not happy because several files are missing in v12. Here is a > corrected and rebased version. I also removed the "#undef PRIu64" > change from include/c.h since previously I replaced PRIu64 usage with > XID_FMT. Here is a new version of the patchset. SLRU refactoring was moved to a separate patch. Both v14-0003 (XID_FMT macro) and v14-0004 (SLRU refactoring) can be delivered in PG15. One thing I couldn't understand so far is why SLRU_PAGES_PER_SEGMENT should necessarily be increased in order to make 64-bit XIDs work. I kept the current value (32) in v14-0004 but changed it to 2048 in ./v14-0005 (where we start using 64 bit XIDs) as it was in the original patch. Is this change really required? -- Best regards, Aleksander Alekseev
Attachment
Hi hackers, > Here is a new version of the patchset. SLRU refactoring was moved to a > separate patch. Both v14-0003 (XID_FMT macro) and v14-0004 (SLRU > refactoring) can be delivered in PG15. Here is a new version of the patchset. The changes compared to v14 are minimal. Most importantly, the GCC warning reported by cfbot was (hopefully) fixed. The patch order was also altered, v15-0001 and v15-0002 are targeting PG15 now, the rest are targeting PG16. Also for the record, I tested the patchset on Raspberry Pi 3 Model B+ in the hope that it will discover some new flaws. To my disappointment, it didn't. -- Best regards, Aleksander Alekseev
Attachment
Hi hackers, > > Here is a new version of the patchset. SLRU refactoring was moved to a > > separate patch. Both v14-0003 (XID_FMT macro) and v14-0004 (SLRU > > refactoring) can be delivered in PG15. > > Here is a new version of the patchset. The changes compared to v14 are > minimal. Most importantly, the GCC warning reported by cfbot was > (hopefully) fixed. The patch order was also altered, v15-0001 and > v15-0002 are targeting PG15 now, the rest are targeting PG16. > > Also for the record, I tested the patchset on Raspberry Pi 3 Model B+ > in the hope that it will discover some new flaws. To my > disappointment, it didn't. Here is the rebased version of the patchset. Also, I updated the commit messages for v16-0001 and v16-002 to make them look more like the rest of the PostgreSQL commit messages. They include the link to this discussion now as well. IMO v16-0001 and v16-0002 are in pretty good shape and are as much as we are going to deliver in PG15. I'm going to change the status of the CF entry to "Ready for Committer" somewhere this week unless someone believes v16-0001 and/or v16-0002 shouldn't be merged. -- Best regards, Aleksander Alekseev
Hi hackers, > IMO v16-0001 and v16-0002 are in pretty good shape and are as much as > we are going to deliver in PG15. I'm going to change the status of the > CF entry to "Ready for Committer" somewhere this week unless someone > believes v16-0001 and/or v16-0002 shouldn't be merged. Sorry for the missing attachment. Here it is. -- Best regards, Aleksander Alekseev
Attachment
Attachment
Hi! Here is updated version of the patch, based on Alexander's ver16.
Hi, Hackers!Hi! Here is updated version of the patch, based on Alexander's ver16.I'd like to add a few quick notes on what's been done in v17.Patches 0001 and 0002 that are planned to be committed to PG15 are almost unchanged with the exception of one unnecessary cast in 0002 removed.We've also addressed several issues in patch 0005 (which is planned for PG16):- The bug with frozen xids after pg_upgrade, reported by Justin [1]- Added proper processing of double xmax pages in HeapPageSetPruneXidInternal()- Fixed xids comparison. Initially in the patch it was changed to simple < <= => > for 64 bit values. Now v17 patch has returned this to the way similar to what is used in STABLE for 32-bit xids, but using modulus-64 numeric ring. The main goal of this change was to fix SRLU tests that were mentioned by Alexander to have been disabled. We've fixed and enabled most of them, but some of them are still need to be fixed and enabled.Also, we've pgindent-ed all the patches.As patches that are planned to be delivered to PG15 are almost unchanged, I completely agree with Alexander's plan to consider these patches (0001 and 0002) as RfC.All activity, improvement, review, etc. related to the whole patchset is also very much appreciated. Big thanks to Alexander for working on the patch set!
At Mon, 14 Mar 2022 19:43:40 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in > > I'd like to add a few quick notes on what's been done in v17. I have some commens by a quick look-through. Apologize in advance for wrong comments from the lack of the knowledge of the whole patch-set. > > Patches 0001 and 0002 that are planned to be committed to PG15 are almost > > unchanged with the exception of one unnecessary cast in 0002 removed. 0001: The XID_FMT has quite bad impact on the translatability of error messages. 3286065651 has removed INT64_FORMAT from translatable texts for the reason. This re-introduces that in several places. 0001 itself does not harm but 0005 replaces XID_FMT with INT64_FORMAT. Other patches have the same issue, too. > > We've also addressed several issues in patch 0005 (which is planned for > > PG16): > > - The bug with frozen xids after pg_upgrade, reported by Justin [1] > > - Added proper processing of double xmax pages in > > HeapPageSetPruneXidInternal() > > - Fixed xids comparison. Initially in the patch it was changed to simple < > > <= => > for 64 bit values. Now v17 patch has returned this to the way > > similar to what is used in STABLE for 32-bit xids, but using modulus-64 > > numeric ring. The main goal of this change was to fix SRLU tests that were mentioned If IIUC, the following part in 0002 doesn't consider wraparound. -asyncQueuePagePrecedes(int p, int q) +asyncQueuePagePrecedes(int64 p, int64 q) { - return asyncQueuePageDiff(p, q) < 0; + return p < q; } > > by Alexander to have been disabled. We've fixed and enabled most of them, > > but some of them are still need to be fixed and enabled. > > > > Also, we've pgindent-ed all the patches. 0005 has "new blank line at EOF". > > As patches that are planned to be delivered to PG15 are almost unchanged, > > I completely agree with Alexander's plan to consider these patches (0001 > > and 0002) as RfC. > > > > All activity, improvement, review, etc. related to the whole patchset is > > also very much appreciated. Big thanks to Alexander for working on the > > patch set! > > > > [1] > > https://www.postgresql.org/message-id/20220115063925.GS14051%40telsasoft.com > > > Also, the patch v17 (0005) returns SLRU_PAGES_PER_SEGMENT to the previous > value of 32. 0002 re-introduces pg_strtouint64, which have been recently removed by 3c6f8c011f. > Simplify the general-purpose 64-bit integer parsing APIs > > pg_strtouint64() is a wrapper around strtoull/strtoul/_strtoui64, but > it seems no longer necessary to have this indirection. .. > type definition int64/uint64. For that, add new macros strtoi64() and > strtou64() in c.h as thin wrappers around strtol()/strtoul() or > strtoll()/stroull(). This makes these functions available everywhere regards. -- Kyotaro Horiguchi NTT Open Source Software Center
0001:
The XID_FMT has quite bad impact on the translatability of error
messages. 3286065651 has removed INT64_FORMAT from translatable
texts for the reason. This re-introduces that in several places.
0001 itself does not harm but 0005 replaces XID_FMT with
INT64_FORMAT. Other patches have the same issue, too.
Attachment
Hi, On Mon, Mar 14, 2022 at 01:32:04PM +0300, Aleksander Alekseev wrote: > IMO v16-0001 and v16-0002 are in pretty good shape and are as much as > we are going to deliver in PG15. I'm going to change the status of the > CF entry to "Ready for Committer" somewhere this week unless someone > believes v16-0001 and/or v16-0002 shouldn't be merged. Not sure, but if you want more people to look at them, probably best would be to start a new thread with just the v15-target patches. Right now, one has to download your tarball, extract it and look at the patches in there. I hope v16-0001 and v16-0002 are small enough (I didn't do the above) that they can just be attached normally? Michael -- Michael Banck Team Lead PostgreSQL Project Manager Tel.: +49 2166 9901-171 Mail: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Management: Dr. Michael Meskes, Geoff Richardson, Peter Lilley Our handling of personal data is subject to: https://www.credativ.de/en/contact/privacy/
At Tue, 15 Mar 2022 18:48:34 +0300, Maxim Orlov <orlovmg@gmail.com> wrote in > Hi Kyotaro! > > 0001: > > > > The XID_FMT has quite bad impact on the translatability of error > > messages. 3286065651 has removed INT64_FORMAT from translatable > > texts for the reason. This re-introduces that in several places. > > 0001 itself does not harm but 0005 replaces XID_FMT with > > INT64_FORMAT. Other patches have the same issue, too. > > > I do understand your concern and I wonder how I can do this better? My > first intention was to replace XID_FMT with %llu and INT64_FORMAT with > %lld. This should solve the translatability issue, but I'm not sure about > portability of this. Should this work on Windows, etc? Can you advise me on > the best solution? Doesn't doing "errmsg("blah blah %lld ..", (long long) xid)" work? > We've fixed all the other things mentioned. Thanks! > > Also added two fixes: > - CF bot was unhappy with pg_upgrade test in v17 because I forgot to add a > fix for computation of relminmxid during vacuum on a fresh database. > - Replace frozen or invalid x_min with FrozenTransactionId or > InvalidTransactionId respectively during tuple conversion to 64xid. > > Reviews are welcome as always! Thanks! My pleasure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
[1]: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
--
Best regards,
Aleksander Alekseev
I forked this thread as requested by several people in the discussion [1].The new thread contains two patches that are targeting PG15. I replaced the thread in the current CF to [1]. This thread was added to the next CF. I suggest we continue discussing changes targeting PG >= 16 here.
[1]: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
Attachment
Attachment
Hi, On 2022-03-18 18:22:00 +0300, Maxim Orlov wrote: > Here is v22 with following changes: > - use explicit unsigned long long cast for printf/elog XIDs instead of > macro XID_TYPE > - add *.po localization > - fix forgotten XIDs format changes in pg_resetwal.c > - 0006 patch refactoring FWIW, 0006 still does way way too many things at once, even with 0001-0003 split out. I don't really see a point in reviewing / posting new versions until that's done. Greetings, Andres Freund
On Fri, Mar 11, 2022 at 08:26:26PM +0300, Aleksander Alekseev wrote: > Hi hackers, > > > Here is a new version of the patchset. SLRU refactoring was moved to a > > separate patch. Both v14-0003 (XID_FMT macro) and v14-0004 (SLRU > > refactoring) can be delivered in PG15. > > Here is a new version of the patchset. The changes compared to v14 are > minimal. Most importantly, the GCC warning reported by cfbot was > (hopefully) fixed. The patch order was also altered, v15-0001 and > v15-0002 are targeting PG15 now, the rest are targeting PG16. Do you know that you can test a branch on cirrus without using CF bot or mailing the patch to the list ? See src/tools/ci/README -- Justin
Do you know that you can test a branch on cirrus without using CF bot or
mailing the patch to the list ? See src/tools/ci/README
Attachment
Attachment
- v31-0005-README.XID64.patch
- v31-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v31-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v31-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v31-0002-Use-64-bit-format-to-output-XIDs.patch
- v31-0006-Use-64-bit-GUCs.patch
- v31-0007-Use-64-bit-XIDs.patch
Attachment
- v32-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v32-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v32-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v32-0005-README.XID64.patch
- v32-0002-Use-64-bit-format-to-output-XIDs.patch
- v32-0006-Use-64-bit-GUCs.patch
- v32-0007-Use-64-bit-XIDs.patch
Attachment
- v34-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v34-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v34-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v34-0002-Use-64-bit-format-to-output-XIDs.patch
- v34-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v34-0006-README.XID64.patch
- v34-0007-Use-64-bit-GUCs.patch
- v34-0008-Use-64-bit-XIDs.patch
Attachment
- v35-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v35-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v35-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v35-0002-Use-64-bit-format-to-output-XIDs.patch
- v35-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v35-0006-README.XID64.patch
- v35-0007-Use-64-bit-GUCs.patch
- v35-0008-Use-64-bit-XIDs.patch
--
Attachment
- v36-0006-README.XID64.patch
- v36-0007-Use-64-bit-GUCs.patch
- v36-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v36-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v36-0008-Use-64-bit-XIDs.patch
- v36-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v36-0002-Use-64-bit-format-to-output-XIDs.patch
- v36-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
Attachment
- v37-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v37-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v37-0007-Use-64-bit-GUCs.patch
- v37-0006-README.XID64.patch
- v37-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v37-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v37-0002-Use-64-bit-format-to-output-XIDs.patch
- v37-0008-Use-64-bit-XIDs.patch
Attachment
- v38-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v38-0007-Use-64-bit-GUCs.patch
- v38-0006-README.XID64.patch
- v38-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v38-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v38-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v38-0008-Use-64-bit-XIDs.patch
- v38-0002-Use-64-bit-format-to-output-XIDs.patch
This seems to be causing cfbot/cirrusci to time out. Here's the build history https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3594 https://cirrus-ci.com/task/4809278652416000 4 weeks ago on macos https://cirrus-ci.com/task/5559884417597440 2 weeks ago on macos https://cirrus-ci.com/task/6629554545491968 2 weeks ago on macos https://cirrus-ci.com/task/5253255562264576 this week on freebsd It seems like there's a larger discussion to be had about the architecture about the patch, but I thought I'd point out this detail.
This seems to be causing cfbot/cirrusci to time out.
Here's the build history
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3594
https://cirrus-ci.com/task/4809278652416000 4 weeks ago on macos
https://cirrus-ci.com/task/5559884417597440 2 weeks ago on macos
https://cirrus-ci.com/task/6629554545491968 2 weeks ago on macos
https://cirrus-ci.com/task/5253255562264576 this week on freebsd
It seems like there's a larger discussion to be had about the architecture
about the patch, but I thought I'd point out this detail.
Attachment
- v39-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v39-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v39-0002-Use-64-bit-format-to-output-XIDs.patch
- v39-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v39-0007-Use-64-bit-GUCs.patch
- v39-0006-README.XID64.patch
- v39-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v39-0008-Use-64-bit-XIDs.patch
Attachment
- v40-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v40-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v40-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v40-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v40-0002-Use-64-bit-format-to-output-XIDs.patch
- v40-0006-README.XID64.patch
- v40-0007-Use-64-bit-GUCs.patch
- v40-0008-Use-64-bit-XIDs.patch
Attachment
- v41-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v41-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v41-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v41-0002-Use-64-bit-format-to-output-XIDs.patch
- v41-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v41-0006-README.XID64.patch
- v41-0007-Use-64-bit-GUCs.patch
- v41-0008-Use-64-bit-XIDs.patch
Attachment
- v42-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v42-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v42-0002-Use-64-bit-format-to-output-XIDs.patch
- v42-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v42-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v42-0006-README.XID64.patch
- v42-0007-Use-64-bit-GUCs.patch
- v42-0008-Use-64-bit-XIDs.patch
Pavel Borisov
Hi, hackers!v42 stopped applying, so we rebased it to v43. Attached is a GitHub link, but I see Cfbot hasn't become green. Apparently, it hasn't seen changes in GitHub link relative to v42 attached as a patch.
Pavel Borisov
On Fri, Jul 15, 2022 at 03:23:29PM +0400, Pavel Borisov wrote: > On Fri, 15 Jul 2022 at 15:20, Pavel Borisov <pashkin.elfe@gmail.com> wrote: > > Hi, hackers! > > v42 stopped applying, so we rebased it to v43. Attached is a GitHub link, > > but I see Cfbot hasn't become green. Apparently, it hasn't seen changes in > > GitHub link relative to v42 attached as a patch. > > Github link is as follows: > https://github.com/ziva777/postgres/tree/64xid-cf > Maybe this will enable CFbot to see it. My suggestion was a bit ambiguous, sorry for the confusion. The "Git link" is just an annotation - cfbot doesn't try to do anything with it[0] (but cirrusci will run the same checks under your github account). What I meant was that it doesn't seem imporant to send rebases multiple times per week if there's been no other changes. Anyone is still able to review the patch. It's possible to build it by applying the patch to a checkout of the postgres tree at the time the patch was sent. Or by using the git link. Or by checking out cfbot's branch here, from the last time it *did* apply. https://github.com/postgresql-cfbot/postgresql/tree/commitfest/38/3594 [0] (I think cfbot *could* try to read the git link, and apply patches that it finds there, but that's an idea that hasn't been proposed or discussed. It'd need to know which branch to use, and it'd need to know when to use the git link and when to use the most-recent email attachments). -- Justin
On Fri, Jul 15, 2022 at 03:23:29PM +0400, Pavel Borisov wrote:
> On Fri, 15 Jul 2022 at 15:20, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> > Hi, hackers!
> > v42 stopped applying, so we rebased it to v43. Attached is a GitHub link,
> > but I see Cfbot hasn't become green. Apparently, it hasn't seen changes in
> > GitHub link relative to v42 attached as a patch.
>
> Github link is as follows:
> https://github.com/ziva777/postgres/tree/64xid-cf
> Maybe this will enable CFbot to see it.
My suggestion was a bit ambiguous, sorry for the confusion.
The "Git link" is just an annotation - cfbot doesn't try to do anything with
it[0] (but cirrusci will run the same checks under your github account).
What I meant was that it doesn't seem imporant to send rebases multiple times
per week if there's been no other changes. Anyone is still able to review the
patch. It's possible to build it by applying the patch to a checkout of the
postgres tree at the time the patch was sent. Or by using the git link. Or by
checking out cfbot's branch here, from the last time it *did* apply.
https://github.com/postgresql-cfbot/postgresql/tree/commitfest/38/3594
[0] (I think cfbot *could* try to read the git link, and apply patches that it
finds there, but that's an idea that hasn't been proposed or discussed. It'd
need to know which branch to use, and it'd need to know when to use the git
link and when to use the most-recent email attachments).
Pavel Borisov
Hi hackers, > I can agree with you that sending rebased patches too often can be a little annoying. On the other hand, otherwise, it'sjust red in Cfbot. I suppose it's much easier and more comfortable to review the patches that at least apply cleanlyand pass all tests. So if Cfbot is red for a long time I feel we need to send a rebased patchset anyway. > > I'll try to not doing this too often but frankly, I don't see a better alternative at the moment. Considering the overall activity on the mailing list personally I don't see a problem here. Several extra emails don't bother me at all, but I would like to see a green cfbot report for an open item in the CF application. Otherwise someone will complain that the patch doesn't apply anymore and the result will be the same as for sending an updated patch, except that we will receive at least two emails instead of one. -- Best regards, Aleksander Alekseev
> I can agree with you that sending rebased patches too often can be a little annoying. On the other hand, otherwise, it's just red in Cfbot. I suppose it's much easier and more comfortable to review the patches that at least apply cleanly and pass all tests. So if Cfbot is red for a long time I feel we need to send a rebased patchset anyway.
>
> I'll try to not doing this too often but frankly, I don't see a better alternative at the moment.
Considering the overall activity on the mailing list personally I
don't see a problem here. Several extra emails don't bother me at all,
but I would like to see a green cfbot report for an open item in the
CF application. Otherwise someone will complain that the patch doesn't
apply anymore and the result will be the same as for sending an
updated patch, except that we will receive at least two emails instead
of one.
Pavel Borisov
Attachment
- v43-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v43-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v43-0006-README.XID64.patch
- v43-0007-Use-64-bit-GUCs.patch
- v43-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v43-0008-Use-64-bit-XIDs.patch
- v43-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v43-0002-Use-64-bit-format-to-output-XIDs.patch
On Wed, Jan 05, 2022 at 06:12:26PM -0600, Justin Pryzby wrote: > On Wed, Jan 05, 2022 at 06:51:37PM -0500, Bruce Momjian wrote: > > On Tue, Jan 4, 2022 at 10:22:50PM +0000, Finnerty, Jim wrote: > > > I'm concerned about the maintainability impact of having 2 new > > > on-disk page formats. It's already complex enough with XIDs and > > > multixact-XIDs. > > > > > > If the lack of space for the two epochs in the special data area is > > > a problem only in an upgrade scenario, why not resolve the problem > > > before completing the upgrade process like a kind of post-process > > > pg_repack operation that converts all "double xmax" pages to > > > the "double-epoch" page format? i.e. maybe the "double xmax" > > > representation is needed as an intermediate representation during > > > upgrade, but after upgrade completes successfully there are no pages > > > with the "double-xmax" representation. This would eliminate a whole > > > class of coding errors and would make the code dealing with 64-bit > > > XIDs simpler and more maintainable. > > > > Well, yes, we could do this, and it would avoid the complexity of having > > to support two XID representations, but we would need to accept that > > fast pg_upgrade would be impossible in such cases, since every page > > would need to be checked and potentially updated. > > > > You might try to do this while the server is first started and running > > queries, but I think we found out from the online checkpoint patch that > > I think you meant the online checksum patch. Which this reminded me of, too. I wondered whether anyone had considered using relation forks to maintain state of these long, transitional processes. Either a whole new fork, or additional bits in the visibility map, which has page-level bits. There'd still need to be a flag somewhere indicating whether checksums/xid64s/etc were enabled cluster-wide. The VM/fork bits would need to be checked while the cluster was being re-processed online. This would add some overhead. After the cluster had reached its target state, the flag could be set, and the VM bits would no longer need to be checked. -- Justin
On Mon, Jul 18, 2022 at 2:54 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> > I can agree with you that sending rebased patches too often can be a little annoying. On the other hand, otherwise,it's just red in Cfbot. I suppose it's much easier and more comfortable to review the patches that at least applycleanly and pass all tests. So if Cfbot is red for a long time I feel we need to send a rebased patchset anyway. >> > >> > I'll try to not doing this too often but frankly, I don't see a better alternative at the moment. >> >> Considering the overall activity on the mailing list personally I >> don't see a problem here. Several extra emails don't bother me at all, >> but I would like to see a green cfbot report for an open item in the >> CF application. Otherwise someone will complain that the patch doesn't >> apply anymore and the result will be the same as for sending an >> updated patch, except that we will receive at least two emails instead >> of one. > > Hi, Alexander! > Agree with you. I also consider green cfbot entry important. So PFA rebased v43. Since we have converted TransactionId to 64-bit, so do we still need the concept of FullTransactionId? I mean it is really confusing to have 3 forms of transaction ids. i.e. Transaction Id, FullTransactionId and ShortTransactionId. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Sep 4, 2022 at 9:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 18, 2022 at 2:54 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > >> > >> > I can agree with you that sending rebased patches too often can be a little annoying. On the other hand, otherwise,it's just red in Cfbot. I suppose it's much easier and more comfortable to review the patches that at least applycleanly and pass all tests. So if Cfbot is red for a long time I feel we need to send a rebased patchset anyway. > >> > > >> > I'll try to not doing this too often but frankly, I don't see a better alternative at the moment. > >> > >> Considering the overall activity on the mailing list personally I > >> don't see a problem here. Several extra emails don't bother me at all, > >> but I would like to see a green cfbot report for an open item in the > >> CF application. Otherwise someone will complain that the patch doesn't > >> apply anymore and the result will be the same as for sending an > >> updated patch, except that we will receive at least two emails instead > >> of one. > > > > Hi, Alexander! > > Agree with you. I also consider green cfbot entry important. So PFA rebased v43. > > Since we have converted TransactionId to 64-bit, so do we still need > the concept of FullTransactionId? I mean it is really confusing to > have 3 forms of transaction ids. i.e. Transaction Id, > FullTransactionId and ShortTransactionId. I have done some more reviews to understand the idea. I think this patch needs far more comments to make it completely readable. 1. typedef struct HeapTupleData { + TransactionId t_xmin; /* base value for normal transaction ids */ + TransactionId t_xmax; /* base value for mutlixact */ I think the field name and comments are not in sync, field says xmin and xmax whereas the comment says base value for transaction id and multi-xact. 2. extern bool heap_page_prepare_for_xid(Relation relation, Buffer buffer, TransactionId xid, bool multi); I noticed that this function is returning bool but all the callers are ignoring the return type. 3. +static int +heap_page_try_prepare_for_xid(Relation relation, Buffer buffer, Page page, + TransactionId xid, bool multi, bool is_toast) +{ + TransactionId base; + ShortTransactionId min = InvalidTransactionId, add function header comments. 4. + if (!multi) + { + Assert(!is_toast || !(htup->t_infomask & HEAP_XMAX_IS_MULTI)); + + if (TransactionIdIsNormal(htup->t_choice.t_heap.t_xmin) && + !HeapTupleHeaderXminFrozen(htup)) + { + xid_min_max(min, max, htup->t_choice.t_heap.t_xmin, &found); + } + + if (htup->t_infomask & HEAP_XMAX_INVALID) + continue; + + if ((htup->t_infomask & HEAP_XMAX_IS_MULTI) && + (!(htup->t_infomask & HEAP_XMAX_LOCK_ONLY))) + { + TransactionId update_xid; + ShortTransactionId xid; + + Assert(!is_toast); + update_xid = MultiXactIdGetUpdateXid(HeapTupleHeaderGetRawXmax(page, htup), + htup->t_infomask); + xid = NormalTransactionIdToShort(HeapPageGetSpecial(page)->pd_xid_base, + update_xid); + + xid_min_max(min, max, xid, &found); + } + } Why no handling for multi? And this function has absolutely no comments to understand the reason for this. 5. + if (IsToastRelation(relation)) + { + PageInit(page, BufferGetPageSize(buffer), sizeof(ToastPageSpecialData)); + ToastPageGetSpecial(page)->pd_xid_base = RecentXmin - FirstNormalTransactionId; + } + else + { + PageInit(page, BufferGetPageSize(buffer), sizeof(HeapPageSpecialData)); + HeapPageGetSpecial(page)->pd_xid_base = RecentXmin - FirstNormalTransactionId; + } Why pd_xid_base can not be just RecentXmin? Please explain in the comments above. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Since we have converted TransactionId to 64-bit, so do we still need
the concept of FullTransactionId? I mean it is really confusing to
have 3 forms of transaction ids. i.e. Transaction Id,
FullTransactionId and ShortTransactionId.
1.
typedef struct HeapTupleData
{
+ TransactionId t_xmin; /* base value for normal transaction ids */
+ TransactionId t_xmax; /* base value for mutlixact */
I think the field name and comments are not in sync, field says xmin
and xmax whereas the comment says base value for
transaction id and multi-xact.
2.
extern bool heap_page_prepare_for_xid(Relation relation, Buffer buffer,
TransactionId xid, bool multi);
I noticed that this function is returning bool but all the callers are
ignoring the return type.
3.
+static int
+heap_page_try_prepare_for_xid(Relation relation, Buffer buffer, Page page,
+ TransactionId xid, bool multi, bool is_toast)
+{
+ TransactionId base;
+ ShortTransactionId min = InvalidTransactionId,
add function header comments.
4.
Why no handling for multi? And this function has absolutely no
comments to understand the reason for this.
5.
Why pd_xid_base can not be just RecentXmin? Please explain in the
comments above.
--
Attachment
- v46-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v46-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v46-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v46-0002-Use-64-bit-format-to-output-XIDs.patch
- v46-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v46-0006-README.XID64.patch
- v46-0007-Use-64-bit-GUCs.patch
- v46-0008-Use-64-bit-XIDs.patch
--
Is this patch target to PG16 now?
I want to have a look at these patches, but apply on master failed:
Applying: Use 64-bit numbering of SLRU pages.
Applying: Use 64-bit format to output XIDs
Applying: Use 64-bit FullTransactionId instead of Epoch:xid
Applying: Use 64-bit pages representation in SLRU callers.
error: patch failed: src/backend/access/transam/multixact.c:1228
error: src/backend/access/transam/multixact.c: patch does not apply
Patch failed at 0004 Use 64-bit pages representation in SLRU callers.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
With these patches, it seems that we don’t need to handle wraparound in
GetNextLocalTransactionId() too, as LocalTransactionId is unit64 now.
```
LocalTransactionId
GetNextLocalTransactionId(void)
{
LocalTransactionId result;
/* loop to avoid returning InvalidLocalTransactionId at wraparound */
do
{
result = nextLocalTransactionId++;
} while (!LocalTransactionIdIsValid(result));
return result;
}
```
On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote: > I want to have a look at these patches, but apply on master failed: Yeah, it's likely to break every week or more often. You have a few options: 0) resolve the conflict yourself; 1) apply the patch to the commit that the authors sent it against, or some commit before the conflicting file(s) were changed in master. Like maybe "git checkout -b 64bitxids f66d997fd". 2) Use the last patch that cfbot successfully created. You can read the patch on github's web interface, or add cfbot's user as a remote to use the patch locally for review and/or compilation. Something like "git remote add cfbot https://github.com/postgresql-cfbot/postgresql; git fetch cfbot commitfest/39/3594; git checkout -b 64bitxids cfbot/commitfest/39/3594". (Unfortunately, cfbot currently squishes the patch series into a single commit and loses the commit message). You could also check the git link in the commitfest, to see if the author has already rebased it, but haven't yet mailed the rebased patch to the list. In this case, that's not true, but you could probably use the author's branch on github, too. https://commitfest.postgresql.org/39/3594/ -- Justin
On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote:I want to have a look at these patches, but apply on master failed:
Yeah, it's likely to break every week or more often.
You have a few options:
0) resolve the conflict yourself;
1) apply the patch to the commit that the authors sent it against, or
some commit before the conflicting file(s) were changed in master. Like
maybe "git checkout -b 64bitxids f66d997fd".
2) Use the last patch that cfbot successfully created. You can read the
patch on github's web interface, or add cfbot's user as a remote to use
the patch locally for review and/or compilation. Something like "git
remote add cfbot https://github.com/postgresql-cfbot/postgresql; git
fetch cfbot commitfest/39/3594; git checkout -b 64bitxids
cfbot/commitfest/39/3594". (Unfortunately, cfbot currently squishes the
patch series into a single commit and loses the commit message).
You could also check the git link in the commitfest, to see if the
author has already rebased it, but haven't yet mailed the rebased patch
to the list. In this case, that's not true, but you could probably use
the author's branch on github, too.
https://commitfest.postgresql.org/39/3594/
--
Justin
Regards,
Zhang Mingli
Here is a rebased version of the patch set.
Major changes are:
1. Fix rare replica fault.
Upon page pruning in heap_page_prune, page fragmentation repair is determined by
a parameter repairFragmentation. At the same time, on a replica, upon handling XLOG_HEAP2_PRUNE record type
in heap_xlog_prune, we always call heap_page_prune_execute with repairFragmentation parameter equal to true.
This caused page inconsistency and lead to the crash of the replica. Fix this by adding new flag in
struct xl_heap_prune.
2. Add support for meson build.
3. Add assertion "buffer is locked" in HeapTupleCopyBaseFromPage.
4. Add assertion "buffer is locked exclusive" in heap_page_shift_base.
5. Prevent excessive growth of xmax in heap_prepare_freeze_tuple.
As always, reviews are very welcome!
--
Attachment
- v47-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v47-0002-Use-64-bit-format-to-output-XIDs.patch
- v47-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v47-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v47-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v47-0007-Use-64-bit-GUCs.patch
- v47-0006-README.XID64.patch
- v47-0008-Use-64-bit-XIDs.patch
On Fri, Oct 07, 2022 at 02:04:09PM +0300, Maxim Orlov wrote: > As always, reviews are very welcome! This patch set needs a rebase, as far as I can see. -- Michael
Attachment
This patch set needs a rebase, as far as I can see.
Attachment
- v48-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v48-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v48-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v48-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v48-0002-Use-64-bit-format-to-output-XIDs.patch
- v48-0006-README.XID64.patch
- v48-0007-Use-64-bit-GUCs.patch
- v48-0008-Use-64-bit-XIDs.patch
On Oct 22, 2022, 00:09 +0800, Maxim Orlov <orlovmg@gmail.com>, wrote:
Done! Thanks! Here is the rebased version.
This version has bug fix for multixact replication. Previous versions of the patch set does not write pd_multi_base in WAL. Thus, this field was set to 0 upon WAL reply on replica.
This caused replica to panic. Fix this by adding pd_multi_base of a page into WAL. Appropriate tap test is added.
Also, add refactoring and improvements in heapam.c in order to reduce diff and make it more "tidy".
Reviews and opinions are very welcome!
--
Best regards,
Maxim Orlov.
These variables are not used any more.
I attach an additional V48-0009 patch as they are just comments, apply it if you want to.
Attachment
I attach an additional V48-0009 patch as they are just comments, apply it if you want to.
Attachment
- v49-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v49-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v49-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v49-0002-Use-64-bit-format-to-output-XIDs.patch
- v49-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v49-0006-README.XID64.patch
- v49-0007-Use-64-bit-GUCs.patch
- v49-0008-Use-64-bit-XIDs.patch
On Thu, 3 Nov 2022 at 08:12, Maxim Orlov <orlovmg@gmail.com> wrote: > > Hi! > >> I attach an additional V48-0009 patch as they are just comments, apply it if you want to. > > Big thank you for your review. I've applied your addition in the recent patch set below. > > Besides, mentioned above, next changes are made: > - rename HeapTupleCopyBaseFromPage to HeapTupleCopyXidsFromPage, since this old name came from the time when еру "t_xid_base"was stored in tuple, > and not correspond to recent state of the code; > - replace ToastTupleHeader* calls with HeapHeader* with the "is_toast" argument. This reduces diff and make the code morereadable; > - put HeapTupleSetZeroXids calls in several places for the sake of redundancy; > - in heap_tuple_would_freeze add case to reset xmax without reading clog; > - rename SeqTupleHeaderSetXmax/Xmin to SeqTupleSetXmax/min and refactoring of the function; Now it will set HeapTuple andHeapTupleHeader xmax; > - add case of int64 values in check_GUC_init; > - massive refactoring in htup_details.h to use inline functions with type control over macro; > - reorder code in htup_details.h to reduce overall diff. > > As always, reviews and opinions are very welcome! 0008 needs a rebase. heapam.h and catversion.h are failing. Regards Thom
0008 needs a rebase. heapam.h and catversion.h are failing.
Regards
Thom
Attachment
- v50-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v50-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v50-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v50-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v50-0002-Use-64-bit-format-to-output-XIDs.patch
- v50-0007-Use-64-bit-GUCs.patch
- v50-0006-README.XID64.patch
- v50-0008-Use-64-bit-XIDs.patch
Hi hackers, > Thanks, done! Dilip Kumar asked a good question in the thread about the 0001..0003 subset [1]. I would like to duplicate it here to make sure it was not missed by mistake: """ Have we measured the WAL overhead because of this patch set? maybe these particular patches will not impact but IIUC this is ground work for making xid 64 bit. So each XLOG record size will increase at least by 4 bytes because the XLogRecord contains the xid. """ Do we have an estimate on this? [1]: https://www.postgresql.org/message-id/CAFiTN-uudj2PY8GsUzFtLYFpBoq_rKegW3On_8ZHdxB1mVv3-A%40mail.gmail.com -- Best regards, Aleksander Alekseev
Hi hackers, > Dilip Kumar asked a good question in the thread about the 0001..0003 > subset [1]. I would like to duplicate it here to make sure it was not > missed by mistake: > > """ > Have we measured the WAL overhead because of this patch set? maybe > these particular patches will not impact but IIUC this is ground work > for making xid 64 bit. So each XLOG record size will increase at > least by 4 bytes because the XLogRecord contains the xid. > """ > > Do we have an estimate on this? I decided to simulate one completely synthetic and non-representative scenario when we write multiple small WAL records. Here is what I did: $ psql -c 'CREATE TABLE phonebook( "id" SERIAL PRIMARY KEY NOT NULL, "name" TEXT NOT NULL, "phone" INT NOT NULL);' $ echo 'INSERT INTO phonebook (name, phone) VALUES (random(), random());' > t.sql $ pgbench -j 8 -c 8 -f t.sql -T 60 eax == 32-bit XIDs == Branch: https://github.com/afiskon/postgres/tree/64bit_xids_v50_14Nov_without_patch pgbench output: ``` number of transactions actually processed: 68650 number of failed transactions: 0 (0.000%) latency average = 6.993 ms initial connection time = 5.415 ms tps = 1144.074340 (without initial connection time) ``` $ ls -lah /home/eax/projects/pginstall/data-master/pg_wal ... -rw------- 1 eax eax 16M Nov 16 12:48 000000010000000000000002 -rw------- 1 eax eax 16M Nov 16 12:47 000000010000000000000003 $ pg_waldump -p ~/projects/pginstall/data-master/pg_wal 000000010000000000000002 000000010000000000000003 | perl -e 'while(<>) { $_ =~ m#len \(rec/tot\):\s*(\d+)/\s*(\d+),#; $rec += $1; $tot += $2; $count++; } $rec /= $count; $tot /= $count; print "rec: $rec, tot: $tot\n";' pg_waldump: error: error in WAL record at 0/28A4118: invalid record length at 0/28A4190: wanted 24, got 0 rec: 65.8201835569952, tot: 67.3479022057689 == 64-bit XIDs == Branch: https://github.com/afiskon/postgres/tree/64bit_xids_v50_14Nov pgbench output: ``` number of transactions actually processed: 68744 number of failed transactions: 0 (0.000%) latency average = 6.983 ms initial connection time = 5.334 ms tps = 1145.664765 (without initial connection time) ``` $ ls -lah /home/eax/projects/pginstall/data-master/pg_wal ... -rw------- 1 eax eax 16M Nov 16 12:32 000000010000000000000002 -rw------- 1 eax eax 16M Nov 16 12:31 000000010000000000000003 $ pg_waldump -p ~/projects/pginstall/data-master/pg_wal 000000010000000000000002 000000010000000000000003 | perl -e 'while(<>) { $_ =~ m#len \(rec/tot\):\s*(\d+)/\s*(\d+),#; $rec += $1; $tot += $2; $count++; } $rec /= $count; $tot /= $count; print "rec: $rec, tot: $tot\n";' pg_waldump: error: error in WAL record at 0/29F4778: invalid record length at 0/29F4810: wanted 26, got 0 rec: 69.1783950928736, tot: 70.6413934278527 So under this load with 64-bit XIDs we see ~5% penalty in terms of the WAL size. This seems to be expected considering the fact that sizeof(XLogRecord) became 4 bytes larger and the average record size was about 66 bytes. -- Best regards, Aleksander Alekseev
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past. I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids, butthere is an important piece of context in reventing xid wraparounds that seems missing from this patch unless I missedsomething. XID wraparound is a symptom, not an underlying problem. It usually occurs when autovacuum or other vacuum strategies haveunexpected stalls and therefore fail to work as expected. Shifting to 64-bit XIDs dramatically changes the sorts ofproblems that these stalls are likely to pose to operational teams. -- you can find you are running out of storage ratherthan facing an imminent database shutdown. Worse, this patch delays the problem until some (possibly far later!) time,when vacuum will take far longer to finish, and options for resolving the problem are diminished. As a result I amconcerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages. What would make a big difference from my perspective would be to combine this with an inverse system for warning that thereis a problem, allowing the administrator to throw warnings about xids since last vacuum, with a configurable threshold. We could have this at two billion by default as that would pose operational warnings not much later than we havenow. Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it takes 300 hours on a database that is shorton space. And I would not want to be facing such a situation. The new status of this patch is: Waiting on Author
> I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past. > > I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids,but there is an important piece of context in reventing xid wraparounds that seems missing from this patch unless Imissed something. > > XID wraparound is a symptom, not an underlying problem. It usually occurs when autovacuum or other vacuum strategies haveunexpected stalls and therefore fail to work as expected. Shifting to 64-bit XIDs dramatically changes the sorts ofproblems that these stalls are likely to pose to operational teams. -- you can find you are running out of storage ratherthan facing an imminent database shutdown. Worse, this patch delays the problem until some (possibly far later!) time,when vacuum will take far longer to finish, and options for resolving the problem are diminished. As a result I amconcerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages. > > What would make a big difference from my perspective would be to combine this with an inverse system for warning that thereis a problem, allowing the administrator to throw warnings about xids since last vacuum, with a configurable threshold. We could have this at two billion by default as that would pose operational warnings not much later than we havenow. > > Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it takes 300 hours on a database that is shorton space. And I would not want to be facing such a situation. Hi, Chris! I had a similar stance when I started working on this patch. Of course, it seemed horrible just to postpone the consequences of inadequate monitoring, too long running transactions that prevent aggressive autovacuum etc. So I can understand your point. With time I've got to a little bit of another view of this feature i.e. 1. It's important to correctly set monitoring, the cut-off of long transactions, etc. anyway. It's not the responsibility of vacuum before wraparound to report inadequate monitoring etc. Furthermore, in real life, this will be already too late if it prevents 32-bit wraparound and invokes much downtime in an unexpected moment of time if it occurs already. (The rough analogy for that is the machine running at 120mph turns every control off and applies full brakes just because the cooling liquid is low (of course there might be a warning previously, but anyway)) 2. The checks and handlers for the event that is never expected in the cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just dropped. Of course we still need to do automatic routine maintenance like cutting SLRU buffers (but with a much bigger interval if we have much disk space e.g.). But I considered that we either can not care what will be with cluster after > 200 years (it will be migrated many times before this, on many reasons not related to Postgres even for the most conservative owners). So the radical proposal is to drop 64-bit wraparound at all. The most moderate one is just not taking very much care that after 200 years we have more hassle than next month if we haven't set up everything correctly. Next month's pain will be more significant even if it teaches dba something. Big thanks for your view on the general implementation of this feature, anyway. Kind regards, Pavel Borisov. Supabase
Hi hackers, > > I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in thepast. > > [...] > > I had a similar stance when I started working on this patch. Of > course, it seemed horrible just to postpone the consequences of > inadequate monitoring, too long running transactions that prevent > aggressive autovacuum etc. So I can understand your point. > > With time I've got to a little bit of another view of this feature i.e. > > 1. It's important to correctly set monitoring, the cut-off of long > transactions, etc. anyway. It's not the responsibility of vacuum > before wraparound to report inadequate monitoring etc. Furthermore, in > real life, this will be already too late if it prevents 32-bit > wraparound and invokes much downtime in an unexpected moment of time > if it occurs already. (The rough analogy for that is the machine > running at 120mph turns every control off and applies full brakes just > because the cooling liquid is low (of course there might be a warning > previously, but anyway)) > > 2. The checks and handlers for the event that is never expected in the > cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just > dropped. Of course we still need to do automatic routine maintenance > like cutting SLRU buffers (but with a much bigger interval if we have > much disk space e.g.). But I considered that we either can not care > what will be with cluster after > 200 years (it will be migrated many > times before this, on many reasons not related to Postgres even for > the most conservative owners). So the radical proposal is to drop > 64-bit wraparound at all. The most moderate one is just not taking > very much care that after 200 years we have more hassle than next > month if we haven't set up everything correctly. Next month's pain > will be more significant even if it teaches dba something. > > Big thanks for your view on the general implementation of this feature, anyway. I'm inclined to agree with Pavel on this one. Keeping 32-bit XIDs in order to intentionally trigger XID wraparound to indicate the ending disk space and/or misconfigured system (by the time when it's usually too late anyway) is a somewhat arguable perspective. It would be great to notify the user about the potential issues with the configuration and/or the fact that VACUUM doesn't catch up. But it doesn't mean we should keep 32-bit XIDs in order to achive this. -- Best regards, Aleksander Alekseev
Question about """ Subject: [PATCH v50 5/8] Add initdb option to initialize cluster with non-standard xid/mxid/mxoff. To date testing database cluster wraparund was not easy as initdb has always inited it with default xid/mxid/mxoff. The option to specify any valid xid/mxid/mxoff at cluster startup will make these things easier. """ Doesn't pg_resetwal already provide that functionality, or at least some of it?
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >> To date testing database cluster wraparund was not easy as initdb has always >> inited it with default xid/mxid/mxoff. The option to specify any valid >> xid/mxid/mxoff at cluster startup will make these things easier. > Doesn't pg_resetwal already provide that functionality, or at least some > of it? pg_resetwal does seem like a better, more useful home for this; it'd allow you to adjust these numbers after initial creation which might be useful. I'm not sure how flexible it is right now in terms of where you can set the new values, but that can always be improved. regards, tom lane
Hi, On 2022-11-21 14:21:35 -0500, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > >> To date testing database cluster wraparund was not easy as initdb has always > >> inited it with default xid/mxid/mxoff. The option to specify any valid > >> xid/mxid/mxoff at cluster startup will make these things easier. > > > Doesn't pg_resetwal already provide that functionality, or at least some > > of it? > > pg_resetwal does seem like a better, more useful home for this; it'd > allow you to adjust these numbers after initial creation which might be > useful. I'm not sure how flexible it is right now in terms of where > you can set the new values, but that can always be improved. IIRC the respective pg_resetwal parameters are really hard to use for something like this, because they don't actually create the respective SLRU segments. We of course could fix that. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-11-21 14:21:35 -0500, Tom Lane wrote: >> pg_resetwal does seem like a better, more useful home for this; it'd >> allow you to adjust these numbers after initial creation which might be >> useful. I'm not sure how flexible it is right now in terms of where >> you can set the new values, but that can always be improved. > IIRC the respective pg_resetwal parameters are really hard to use for > something like this, because they don't actually create the respective > SLRU segments. We of course could fix that. Is that still true? We should fix it, for sure. regards, tom lane
Hi, On 2022-11-21 15:16:46 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-11-21 14:21:35 -0500, Tom Lane wrote: > >> pg_resetwal does seem like a better, more useful home for this; it'd > >> allow you to adjust these numbers after initial creation which might be > >> useful. I'm not sure how flexible it is right now in terms of where > >> you can set the new values, but that can always be improved. > > > IIRC the respective pg_resetwal parameters are really hard to use for > > something like this, because they don't actually create the respective > > SLRU segments. We of course could fix that. > > Is that still true? We should fix it, for sure. Sure looks that way to me. I think it might mostly work if you manage to find nextXid, nextMulti, nextMultiOffset values that each point to the start of a segment that'd then be created whenever those values are used. Greetings, Andres Freund
Hi hackers,
> > I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past.
>
> [...]
>
> I had a similar stance when I started working on this patch. Of
> course, it seemed horrible just to postpone the consequences of
> inadequate monitoring, too long running transactions that prevent
> aggressive autovacuum etc. So I can understand your point.
>
> With time I've got to a little bit of another view of this feature i.e.
>
> 1. It's important to correctly set monitoring, the cut-off of long
> transactions, etc. anyway. It's not the responsibility of vacuum
> before wraparound to report inadequate monitoring etc. Furthermore, in
> real life, this will be already too late if it prevents 32-bit
> wraparound and invokes much downtime in an unexpected moment of time
> if it occurs already. (The rough analogy for that is the machine
> running at 120mph turns every control off and applies full brakes just
> because the cooling liquid is low (of course there might be a warning
> previously, but anyway))
>
> 2. The checks and handlers for the event that is never expected in the
> cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just
> dropped. Of course we still need to do automatic routine maintenance
> like cutting SLRU buffers (but with a much bigger interval if we have
> much disk space e.g.). But I considered that we either can not care
> what will be with cluster after > 200 years (it will be migrated many
> times before this, on many reasons not related to Postgres even for
> the most conservative owners). So the radical proposal is to drop
> 64-bit wraparound at all. The most moderate one is just not taking
> very much care that after 200 years we have more hassle than next
> month if we haven't set up everything correctly. Next month's pain
> will be more significant even if it teaches dba something.
>
> Big thanks for your view on the general implementation of this feature, anyway.
I'm inclined to agree with Pavel on this one. Keeping 32-bit XIDs in
order to intentionally trigger XID wraparound to indicate the ending
disk space and/or misconfigured system (by the time when it's usually
too late anyway) is a somewhat arguable perspective. It would be great
to notify the user about the potential issues with the configuration
and/or the fact that VACUUM doesn't catch up. But it doesn't mean we
should keep 32-bit XIDs in order to achive this.
--
Best regards,
Aleksander Alekseev
> I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past.
>
> I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids, but there is an important piece of context in reventing xid wraparounds that seems missing from this patch unless I missed something.
>
> XID wraparound is a symptom, not an underlying problem. It usually occurs when autovacuum or other vacuum strategies have unexpected stalls and therefore fail to work as expected. Shifting to 64-bit XIDs dramatically changes the sorts of problems that these stalls are likely to pose to operational teams. -- you can find you are running out of storage rather than facing an imminent database shutdown. Worse, this patch delays the problem until some (possibly far later!) time, when vacuum will take far longer to finish, and options for resolving the problem are diminished. As a result I am concerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages.
>
> What would make a big difference from my perspective would be to combine this with an inverse system for warning that there is a problem, allowing the administrator to throw warnings about xids since last vacuum, with a configurable threshold. We could have this at two billion by default as that would pose operational warnings not much later than we have now.
>
> Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it takes 300 hours on a database that is short on space. And I would not want to be facing such a situation.
Hi, Chris!
I had a similar stance when I started working on this patch. Of
course, it seemed horrible just to postpone the consequences of
inadequate monitoring, too long running transactions that prevent
aggressive autovacuum etc. So I can understand your point.
With time I've got to a little bit of another view of this feature i.e.
1. It's important to correctly set monitoring, the cut-off of long
transactions, etc. anyway. It's not the responsibility of vacuum
before wraparound to report inadequate monitoring etc. Furthermore, in
real life, this will be already too late if it prevents 32-bit
wraparound and invokes much downtime in an unexpected moment of time
if it occurs already. (The rough analogy for that is the machine
running at 120mph turns every control off and applies full brakes just
because the cooling liquid is low (of course there might be a warning
previously, but anyway))
Right now the way things work is:
1. Database starts throwing warnings that xid wraparound is approaching
2. Database-owning team initiates an emergency response, may take downtime or degradation of services as a result
3. People get frustrated with PostgreSQL because this is a reliability problem.
What I am worried about is:
1. Database is running out of space
2. Database-owning team initiates an emergency response and takes more downtime to into a good spot
3. People get frustrated with PostgreSQL because this is a reliability problem.
If that's the way we go, I don't think we've solved that much. And as humans we also bias our judgments towards newsworthy events, so rarer, more severe problems are a larger perceived problem than the more routine, less severe problems. So I think our image as a reliable database would suffer.
An ideal resolution from my perspective would be:
1. Database starts throwing warnings that xid lag has reached severely abnormal levels
2. Database owning team initiates an effort to correct this, and does not take downtime or degradation of services as a result
Now, 64-big xids are necessary to get us there but they are not sufficient. One needs to fix the way we handle this sort of problem. There is existing logic to warn if we are approaching xid wraparound. This should be changed to check how many xids we have used rather than remaining and have a sensible default there (optionally configurable).
I agree it is not vacuum's responsibility. It is the responsibility of the current warnings we have to avoid more serious problems arising from this change. These should just be adjusted rather than dropped.
2. The checks and handlers for the event that is never expected in the
cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just
dropped. Of course we still need to do automatic routine maintenance
like cutting SLRU buffers (but with a much bigger interval if we have
much disk space e.g.). But I considered that we either can not care
what will be with cluster after > 200 years (it will be migrated many
times before this, on many reasons not related to Postgres even for
the most conservative owners). So the radical proposal is to drop
64-bit wraparound at all. The most moderate one is just not taking
very much care that after 200 years we have more hassle than next
month if we haven't set up everything correctly. Next month's pain
will be more significant even if it teaches dba something.
Big thanks for your view on the general implementation of this feature, anyway.
Kind regards,
Pavel Borisov.
Supabase
Hi Chris, > Right now the way things work is: > 1. Database starts throwing warnings that xid wraparound is approaching > 2. Database-owning team initiates an emergency response, may take downtime or degradation of services as a result > 3. People get frustrated with PostgreSQL because this is a reliability problem. > > What I am worried about is: > 1. Database is running out of space > 2. Database-owning team initiates an emergency response and takes more downtime to into a good spot > 3. People get frustrated with PostgreSQL because this is a reliability problem. > > If that's the way we go, I don't think we've solved that much. And as humans we also bias our judgments towards newsworthyevents, so rarer, more severe problems are a larger perceived problem than the more routine, less severe problems. So I think our image as a reliable database would suffer. > > An ideal resolution from my perspective would be: > 1. Database starts throwing warnings that xid lag has reached severely abnormal levels > 2. Database owning team initiates an effort to correct this, and does not take downtime or degradation of services asa result > 3. People do not get frustrated because this is not a reliability problem anymore. > > Now, 64-big xids are necessary to get us there but they are not sufficient. One needs to fix the way we handle this sortof problem. There is existing logic to warn if we are approaching xid wraparound. This should be changed to check howmany xids we have used rather than remaining and have a sensible default there (optionally configurable). > > I agree it is not vacuum's responsibility. It is the responsibility of the current warnings we have to avoid more seriousproblems arising from this change. These should just be adjusted rather than dropped. I disagree with the axiom that XID wraparound is merely a symptom and not a problem. Using 32-bit XIDs was a reasonable design decision back when disk space was limited and disks were slow. The drawback of this approach is the need to do the wraparound but agaig back then it was a reasonable design choice. If XIDs were 64-bit from the beginning users could run one billion (1,000,000,000) TPS for 584 years without a wraparound. We wouldn't have it similarly as there is no wraparound for WAL segments. Now when disks are much faster and much cheaper 32-bit XIDs are almost certainly not a good design choice anymore. (Especially considering the fact that this particular patch mitigates the problem of increased disk consumption greatly.) Also I disagree with an argument that a DBA that doesn't monitor disk space would care much about some strange warnings in the logs. If a DBA doesn't monitor basic system metrics I'm afraid we can't help this person much. I do agree that we could probably provide some additional help for the rest of the users when it comes to configuring VACUUM. This is indeed non-trivial. However I don't think this is in scope of this particular patchset. I suggest we keep the focus in this discussion. If you have a concrete proposal please consider starting a new thread. This at least is my personal opinion. Let's give the rest of the community a chance to share their thoughts. -- Best regards, Aleksander Alekseev
Hi, Alexander! On Tue, 22 Nov 2022 at 13:01, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi Chris, > > > Right now the way things work is: > > 1. Database starts throwing warnings that xid wraparound is approaching > > 2. Database-owning team initiates an emergency response, may take downtime or degradation of services as a result > > 3. People get frustrated with PostgreSQL because this is a reliability problem. > > > > What I am worried about is: > > 1. Database is running out of space > > 2. Database-owning team initiates an emergency response and takes more downtime to into a good spot > > 3. People get frustrated with PostgreSQL because this is a reliability problem. > > > > If that's the way we go, I don't think we've solved that much. And as humans we also bias our judgments towards newsworthyevents, so rarer, more severe problems are a larger perceived problem than the more routine, less severe problems. So I think our image as a reliable database would suffer. > > > > An ideal resolution from my perspective would be: > > 1. Database starts throwing warnings that xid lag has reached severely abnormal levels > > 2. Database owning team initiates an effort to correct this, and does not take downtime or degradation of services asa result > > 3. People do not get frustrated because this is not a reliability problem anymore. > > > > Now, 64-big xids are necessary to get us there but they are not sufficient. One needs to fix the way we handle thissort of problem. There is existing logic to warn if we are approaching xid wraparound. This should be changed to checkhow many xids we have used rather than remaining and have a sensible default there (optionally configurable). > > > > I agree it is not vacuum's responsibility. It is the responsibility of the current warnings we have to avoid more seriousproblems arising from this change. These should just be adjusted rather than dropped. > > I disagree with the axiom that XID wraparound is merely a symptom and > not a problem. > > Using 32-bit XIDs was a reasonable design decision back when disk > space was limited and disks were slow. The drawback of this approach > is the need to do the wraparound but agaig back then it was a > reasonable design choice. If XIDs were 64-bit from the beginning users > could run one billion (1,000,000,000) TPS for 584 years without a > wraparound. We wouldn't have it similarly as there is no wraparound > for WAL segments. Now when disks are much faster and much cheaper > 32-bit XIDs are almost certainly not a good design choice anymore. > (Especially considering the fact that this particular patch mitigates > the problem of increased disk consumption greatly.) > > Also I disagree with an argument that a DBA that doesn't monitor disk > space would care much about some strange warnings in the logs. If a > DBA doesn't monitor basic system metrics I'm afraid we can't help this > person much. > > I do agree that we could probably provide some additional help for the > rest of the users when it comes to configuring VACUUM. This is indeed > non-trivial. However I don't think this is in scope of this particular > patchset. I suggest we keep the focus in this discussion. If you have > a concrete proposal please consider starting a new thread. > > This at least is my personal opinion. Let's give the rest of the > community a chance to share their thoughts. I agree with Alexander, that notifications for DBA are a little bit outside the scope of the activity in this thread unless we've just dropped some existing notifications, considering they're not significant anymore. If that was the point, please Chris mention what existing notifications you want to return. I don't think it's a big deal to have the patch with certain notifications inherited from Master branch. Kind regards, Pavel Borisov Supabase.
Hi hackers, [ Excluding my personal e-mail from cc:, not sure how it got there. Please don't cc: to afiskon@gmail.com, I'm not using it for reading pgsql-hackers@. ] > I agree with Alexander, that notifications for DBA are a little bit > outside the scope of the activity in this thread unless we've just > dropped some existing notifications, considering they're not > significant anymore. If that was the point, please Chris mention what > existing notifications you want to return. I don't think it's a big > deal to have the patch with certain notifications inherited from > Master branch. To clarify a bit: currently we DO notify the user about the upcoming wraparound point [1]: """ If for some reason autovacuum fails to clear old XIDs from a table, the system will begin to emit warning messages like this when the database's oldest XIDs reach forty million transactions from the wraparound point: WARNING: database "mydb" must be vacuumed within 39985967 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. """ So I'm not sure how the notification Chris proposes should differ or why it is in scope of this patch. If the point was to make sure certain existing notifications will be preserved - sure, why not. [1]: https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND -- Best regards, Aleksander Alekseev
On Tue, Nov 22, 2022 at 7:44 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi hackers, > > [ Excluding my personal e-mail from cc:, not sure how it got there. > Please don't cc: to afiskon@gmail.com, I'm not using it for reading > pgsql-hackers@. ] > > > I agree with Alexander, that notifications for DBA are a little bit > > outside the scope of the activity in this thread unless we've just > > dropped some existing notifications, considering they're not > > significant anymore. If that was the point, please Chris mention what > > existing notifications you want to return. I don't think it's a big > > deal to have the patch with certain notifications inherited from > > Master branch. > > To clarify a bit: currently we DO notify the user about the upcoming > wraparound point [1]: > > """ > If for some reason autovacuum fails to clear old XIDs from a table, > the system will begin to emit warning messages like this when the > database's oldest XIDs reach forty million transactions from the > wraparound point: > > WARNING: database "mydb" must be vacuumed within 39985967 transactions > HINT: To avoid a database shutdown, execute a database-wide VACUUM in > that database. > """ > > So I'm not sure how the notification Chris proposes should differ or > why it is in scope of this patch. If the point was to make sure > certain existing notifications will be preserved - sure, why not. IMHO, after having 64-bit XID this WARNING doesn't really make sense. Those warnings exist because those limits were problematic for 32-bit xid but now it is not so I think we should not have such warnings. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi Chris,
> Right now the way things work is:
> 1. Database starts throwing warnings that xid wraparound is approaching
> 2. Database-owning team initiates an emergency response, may take downtime or degradation of services as a result
> 3. People get frustrated with PostgreSQL because this is a reliability problem.
>
> What I am worried about is:
> 1. Database is running out of space
> 2. Database-owning team initiates an emergency response and takes more downtime to into a good spot
> 3. People get frustrated with PostgreSQL because this is a reliability problem.
>
> If that's the way we go, I don't think we've solved that much. And as humans we also bias our judgments towards newsworthy events, so rarer, more severe problems are a larger perceived problem than the more routine, less severe problems. So I think our image as a reliable database would suffer.
>
> An ideal resolution from my perspective would be:
> 1. Database starts throwing warnings that xid lag has reached severely abnormal levels
> 2. Database owning team initiates an effort to correct this, and does not take downtime or degradation of services as a result
> 3. People do not get frustrated because this is not a reliability problem anymore.
>
> Now, 64-big xids are necessary to get us there but they are not sufficient. One needs to fix the way we handle this sort of problem. There is existing logic to warn if we are approaching xid wraparound. This should be changed to check how many xids we have used rather than remaining and have a sensible default there (optionally configurable).
>
> I agree it is not vacuum's responsibility. It is the responsibility of the current warnings we have to avoid more serious problems arising from this change. These should just be adjusted rather than dropped.
I disagree with the axiom that XID wraparound is merely a symptom and
not a problem.
1. Autovacuum is stalled, and
2. Monitoring is not checking for xid lag (which would be fixed by autovacuum if it were running properly).
XID wraparound is downstream of those problems. At least that is my experience. If you disagree, I would like to hear why.
Additionally those problems still will cause worse outages with this change unless there are some mitigating measures in place. If you don't like my proposal, I would be open to other mitigating measures. But I think there need to be mitigating measures in a change like this.
Using 32-bit XIDs was a reasonable design decision back when disk
space was limited and disks were slow. The drawback of this approach
is the need to do the wraparound but agaig back then it was a
reasonable design choice. If XIDs were 64-bit from the beginning users
could run one billion (1,000,000,000) TPS for 584 years without a
wraparound. We wouldn't have it similarly as there is no wraparound
for WAL segments. Now when disks are much faster and much cheaper
32-bit XIDs are almost certainly not a good design choice anymore.
(Especially considering the fact that this particular patch mitigates
the problem of increased disk consumption greatly.)
Also I disagree with an argument that a DBA that doesn't monitor disk
space would care much about some strange warnings in the logs. If a
DBA doesn't monitor basic system metrics I'm afraid we can't help this
person much.
Now that's fine if you want to run a bloatless table engine but to my knowledge none of these are production-ready yet. ZHeap seems mostly stalled. Oriole is still experimental. But with the current PostgreSQL table structure.
A DBA can monitor disk space, but if the DBA is not also monitoring xid lag, then by the time corrective action is taken it may be too late.
I do agree that we could probably provide some additional help for the
rest of the users when it comes to configuring VACUUM. This is indeed
non-trivial. However I don't think this is in scope of this particular
patchset. I suggest we keep the focus in this discussion. If you have
a concrete proposal please consider starting a new thread.
This at least is my personal opinion. Let's give the rest of the
community a chance to share their thoughts.
--
Best regards,
Aleksander Alekseevh
Hi Chris, > XID wraparound doesn't happen to healthy databases > If you disagree, I would like to hear why. Consider the case when you run a slow OLAP query that takes 12h to complete and 100K TPS of fast OLTP-type queries on the same system. The fast queries will consume all 32-bit XIDs in less than 12 hours, while the OLAP query started 12 hours ago didn't finish yet and thus its tuples can't be frozen. > I agree that 64-bit xids are a good idea. I just don't think that existing safety measures should be ignored or reverted. Fair enough. > The problem isn't just the lack of disk space, but the difficulty that stuck autovacuum runs pose in resolving the issue. Keep in mind that everything you can do to reclaim disk space (vacuum full, cluster, pg_repack) will be significantlyslowed down by an extremely bloated table/index comparison. The problem is that if you are running out of diskspace, and your time to recovery much longer than expected, then you have a major problem. It's not just one or theother, but the combination that poses the real risk here. > > Now that's fine if you want to run a bloatless table engine but to my knowledge none of these are production-ready yet. ZHeap seems mostly stalled. Oriole is still experimental. But with the current PostgreSQL table structure. > > A DBA can monitor disk space, but if the DBA is not also monitoring xid lag, then by the time corrective action is takenit may be too late. Good point but I still don't think this is related to the XID wraparound problem. -- Best regards, Aleksander Alekseev
On Fri, 21 Oct 2022 at 17:09, Maxim Orlov <orlovmg@gmail.com> wrote: > Reviews and opinions are very welcome! I'm wondering whether the safest way to handle this is by creating a new TAM called "heap64", so that all storage changes happens there. (Obviously there are still many other changes in core, but they are more easily fixed). That would reduce the code churn around "heap", allowing us to keep it stable while we move to the brave new world. Many current users see stability as one of the greatest strengths of Postgres, so while I very much support this move, I wonder if this gives us a way to have both stability and innovation at the same time? -- Simon Riggs http://www.EnterpriseDB.com/
On Sun, Nov 20, 2022 at 11:58 PM Chris Travers <chris.travers@gmail.com> wrote: > I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids,but there is an important piece of context in reventing xid wraparounds that seems missing from this patch unless Imissed something. > > XID wraparound is a symptom, not an underlying problem. It usually occurs when autovacuum or other vacuum strategies haveunexpected stalls and therefore fail to work as expected. Shifting to 64-bit XIDs dramatically changes the sorts ofproblems that these stalls are likely to pose to operational teams. -- you can find you are running out of storage ratherthan facing an imminent database shutdown. Worse, this patch delays the problem until some (possibly far later!) time,when vacuum will take far longer to finish, and options for resolving the problem are diminished. As a result I amconcerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages. This is exactly what I think (except perhaps for the part about having fewer outages overall). The more transaction ID space you need, the more space you're likely to need in the near future. We can all agree that having more runway is better than having less runway, at least in some abstract sense, but that in itself doesn't help the patch series very much. The first time the system-level oldestXid (or database level datminfrozenxid) attains an age of 2 billion XIDs will usually *also* be the first time it attains an age of (say) 300 million XIDs. Even 300 million is usually a huge amount of XID space relative to (say) the number of XIDs used every 24 hours. So I know exactly what you mean about just addressing a symptom. The whole project seems to just ignore basic, pertinent questions. Questions like: why are we falling behind like this in the first place? And: If we don't catch up soon, why should we be able to catch up later on? Falling behind on freezing is still a huge problem with 64-bit XIDs. Part of the problem with the current design is that table age has approximately zero relationship with the true cost of catching up on freezing -- we are "using the wrong units", in a very real sense. In general we may have to do zero freezing to advance a table's relfrozenxid age by a billion XIDs, or we might have to write terabytes of FREEZE_PAGE records to advance a similar looking table's relfrozenxid by just one single XID (it could also swing wildly over time for the same table). Which the system simply doesn't try to reason about right now. There are no settings for freezing that use physical units, and frame the problem as a problem of being behind by this many unfrozen pages (they are all based on XID age). And so the problem with letting table age get into the billions isn't even that we'll never catch up -- we actually might catch up very easily! The real problem is that we have no way of knowing ahead of time (at least not right now). VACUUM should be managing the debt, *and* the uncertainty about how much debt we're really in. VACUUM needs to have a dynamic, probabilistic understanding of what's really going on -- something much more sophisticated than looking at table age in autovacuum.c. One reason why you might want to advance relfrozenxid proactively is to give the system a better general sense of the true relationship between logical XID space and physical freezing for a given table and workload -- it gives a clearer picture about the conditions in the table. The relationship between SLRU space and physical heap pages and the work of freezing is made somewhat clearer by a more proactive approach to advancing relfrozenxid. That's one way that VACUUM can lower the uncertainty I referred to. -- Peter Geoghegan
Hi hackers, > I'm wondering whether the safest way to handle this is by creating a > new TAM called "heap64", so that all storage changes happens there. > Many current users see stability as one of the greatest strengths of > Postgres, so while I very much support this move, I wonder if this > gives us a way to have both stability and innovation at the same time? That would be nice. However from what I see TransactionId is a type used globally in PostgresSQL. It is part of structures used by TAM interface, used in WAL records, etc. So we will have to learn these components to work with 64-bit XIDs anyway and then start thinking about cases like: when a user runs a transaction affecting two tables, a heap32 one and heap64 one and we will have to figure out which tuples are visible and which are not. This perhaps is doable but the maintenance burden for the project will be too high IMO. It seems to me that the best option we can offer for the users looking for stability is to use the latest PostgreSQL version with 32-bit XIDs. Assuming these users care that much about this particular design choice of course. > The whole project seems to just ignore basic, pertinent questions. > Questions like: why are we falling behind like this in the first > place? And: If we don't catch up soon, why should we be able to catch > up later on? Falling behind on freezing is still a huge problem with > 64-bit XIDs. Is the example I provided above wrong? """ Consider the case when you run a slow OLAP query that takes 12h to complete and 100K TPS of fast OLTP-type queries on the same system. The fast queries will consume all 32-bit XIDs in less than 12 hours, while the OLAP query started 12 hours ago didn't finish yet and thus its tuples can't be frozen. """ If it is, please let me know. I would very much like to know if my understanding here is flawed. -- Best regards, Aleksander Alekseev
Trying to discuss where we are talking past eachother.....
Hi hackers,
> I'm wondering whether the safest way to handle this is by creating a
> new TAM called "heap64", so that all storage changes happens there.
> Many current users see stability as one of the greatest strengths of
> Postgres, so while I very much support this move, I wonder if this
> gives us a way to have both stability and innovation at the same time?
That would be nice.
However from what I see TransactionId is a type used globally in
PostgresSQL. It is part of structures used by TAM interface, used in
WAL records, etc. So we will have to learn these components to work
with 64-bit XIDs anyway and then start thinking about cases like: when
a user runs a transaction affecting two tables, a heap32 one and
heap64 one and we will have to figure out which tuples are visible and
which are not. This perhaps is doable but the maintenance burden for
the project will be too high IMO.
It seems to me that the best option we can offer for the users looking
for stability is to use the latest PostgreSQL version with 32-bit
XIDs. Assuming these users care that much about this particular design
choice of course.
Also related to that, I think you would have to have a check on streaming replication that both instances use the same xid format (that you don't accidently upgrade this somehow), since this is set per db cluster, right?
> The whole project seems to just ignore basic, pertinent questions.
> Questions like: why are we falling behind like this in the first
> place? And: If we don't catch up soon, why should we be able to catch
> up later on? Falling behind on freezing is still a huge problem with
> 64-bit XIDs.
Is the example I provided above wrong?
"""
Consider the case when you run a slow OLAP query that takes 12h to
complete and 100K TPS of fast OLTP-type queries on the same system.
The fast queries will consume all 32-bit XIDs in less than 12 hours,
while the OLAP query started 12 hours ago didn't finish yet and thus
its tuples can't be frozen.
"""
If it is, please let me know. I would very much like to know if my
understanding here is flawed.
This being said, there is another set of xid wraparound cases which today is much larger in number that I think would be hurt if this patchset were to be accepted into Postgres without mitigating measures which you consider out of bounds -- the cases like Mailchimp, Adjust, and the like. This is why I keep stressing this, and I don't think waiving away concerns about use cases outside of the one you are focusing on is helpful, particularly from those of us who have faced xid wraparounds in these cases in the past. In these cases, database teams are usually faced with an operational emergency while tools like vacuum, pg_repack, etc are severely degraded due to getting so far behind on freezing. The deeper the hole, the harder it will be to dig out of.
--
Best regards,
Aleksander Alekseev
--
Attachment
- v51-0005-Add-initdb-option-to-initialize-cluster-with-non.patch
- v51-0004-Use-64-bit-pages-representation-in-SLRU-callers.patch
- v51-0003-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v51-0001-Use-64-bit-numbering-of-SLRU-pages.patch
- v51-0002-Use-64-bit-format-to-output-XIDs.patch
- v51-0006-README.XID64.patch
- v51-0007-Use-64-bit-GUCs.patch
- v51-0008-Use-64-bit-XIDs.patch
On Sat, Nov 26, 2022 at 4:08 AM Chris Travers <chris@orioledata.com> wrote: > I didn't see any changes to pg_upgrade to make this change possible on upgrade. Is that also outside of the scope of yourpatch set? If so how is that continuity supposed to be ensured? The scheme is documented in their 0006 patch, in a README.XID file. I'm not entirely confident that it's the best design and have argued against it in the past, but it's not crazy. More generally, while I think there's plenty of stuff to be concerned about in this patch set and while I'm somewhat skeptical about the likelihood of its getting or staying committed, I can't really understand your concerns in particular. The thrust of your concern seems to be that if we allow people to get further behind, recovery will be more difficult. I'm not sure I see the problem. Suppose that we adopt this proposal and that it is bug-free. Now, consider a user who gets 8 billion XIDs behind. They probably have to vacuum pretty much every page in the database to do that, or least every page in the tables that haven't been vacuumed recently. But that would likely also be true if they were 800 million XIDs behind, as is possible today. The effort to catch up doesn't increase linearly with how far behind you are, and is always bounded by the DB size. It is true that if the table is progressively bloating, it is likely to be more bloated by the time you are 8 billion XIDs behind than it was when you were 800 million XIDs behind. I don't see that as a very good reason not to adopt this patch, because you can bloat the table by an arbitrarily large amount while consuming only a small number of XiDs, even just 1 XID. Protecting against bloat is good, but shutting down the database when the XID age reaches a certain value is not a particularly effective way of doing that, so saying that we'll be hurting people by not shutting down the database at the point where we do so today doesn't ring true to me. I think that most people who get to the point of wraparound shutdown have workloads where bloat isn't a huge issue, because those who do start having problems with the bloat way before they run out of XIDs. It would be entirely possible to add a parameter to the system that says "hey, you know we can keep running even if we're a shazillion XiDs behind, but instead shut down when we are behind by this number of XIDs." Then, if somebody wants to force an automatic shutdown at that point, they could, and I think that then the scenario you're worried about just can't happen any more . But isn't that a little bit silly? You could also just monitor how far behind you are and page the DBA when you get behind by more than a certain number of XIDs. Then, you wouldn't be risking a shutdown, and you'd still be able to stay on top of the XID ages of your tables. Philosophically, I disagree with the idea of shutting down the database completely in any situation in which a reasonable alternative exists. Losing read and write availability is really bad, and I don't think it's what users want. I think that most users want the database to degrade gracefully when things do not go according to plan. Ideally, they'd like everything to Just Work, but reasonable users understand that sometimes there are going to be problems, and in my experience, what makes them happy is when the database acts to contain the scope of the problem so that it affects their workload as little as possible, rather than acting to magnify the problem so that it impacts their workload as much as possible. This patch, implementation and design concerns to one side, does that. I don't believe there's a single right answer to the question of what to do about vacuum falling behind, and I think it's worth exploring multiple avenues to improve the situation. You can have vacuum never run on a table at all, say because all of the workers are busy elsewhere, or because the table is locked until the heat death of the universe. You can have vacuum run on a table but too slowly to do any good, because of the vacuum cost delay mechanism. You can have vacuum run and finish but do little good because of prepared transactions or replication slots or long-running queries. It's reasonable to think about what kinds of steps might help in those different scenarios, and especially to think about what kind of steps might help in multiple cases. We should do that. But, I don't think any of that means that we can ignore the need for some kind of expansion of the XID space forever. Computers are getting faster. It's already possible to burn through the XID space in hours, and the number of hours is going to go down over time and maybe eventually the right unit will be minutes, or even seconds. Sometime before then, we need to do something to make the runway bigger, or else just give up on PostgreSQL being a relevant piece of software. Perhaps the thing we need to do is not exactly this, but if not, it's probably a sibling or cousin of this. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 28, 2022 at 8:53 AM Robert Haas <robertmhaas@gmail.com> wrote: > It is true that if the table is progressively bloating, it is likely > to be more bloated by the time you are 8 billion XIDs behind than it > was when you were 800 million XIDs behind. I don't see that as a very > good reason not to adopt this patch, because you can bloat the table > by an arbitrarily large amount while consuming only a small number of > XiDs, even just 1 XID. Protecting against bloat is good, but shutting > down the database when the XID age reaches a certain value is not a > particularly effective way of doing that, so saying that we'll be > hurting people by not shutting down the database at the point where we > do so today doesn't ring true to me. I can't speak for Chris, but I think that almost everybody will agree on this much, without really having to think about it. It's easy to see that having more XID space is, in general, strictly a good thing. If there was a low risk way of getting that benefit, then I'd be in favor of it. Here's the problem that I see with this patch: I don't think that the risks are commensurate with the benefits. I can imagine being in favor of an even more invasive patch that (say) totally removes the concept of freezing, but that would have to be a very different sort of design. > Philosophically, I disagree with the idea of shutting down the > database completely in any situation in which a reasonable alternative > exists. Losing read and write availability is really bad, and I don't > think it's what users want. At a certain point it may make more sense to activate XidStopLimit protections (which will only prevent new XID allocations) instead of getting further behind on freezing, even in a world where we're never strictly obligated to activate XidStopLimit. It may in fact be the lesser evil, even with 64-bit XIDs -- because we still have to freeze, and the factors that drive when and how we freeze mostly aren't changed. Fundamentally, when we're falling behind on freezing, at a certain point we can expect to keep falling behind -- unless some kind of major shift happens. That's just how freezing works, with or without 64-bit XIDs/MXIDs. If VACUUM isn't keeping up with the allocation of transactions, then the system is probably misconfigured in some way. We should do our best to signal this as early and as frequently as possible, and we should mitigate specific hazards (e.g. old temp tables) if at all possible. We should activate the failsafe when things really start to look dicey (which, incidentally, the patch just removes). These mitigations may be very effective, but in the final analysis they don't address the fundamental weakness in freezing. Granted, the specifics of the current XidStopLimit mechanism are unlikely to directly carry over to 64-bit XIDs. XidStopLimit is structured in a way that doesn't actually consider freeze debt in units like unfrozen pages. Like Chris, I just don't see why the patch obviates the need for something like XidStopLimit, since the patch doesn't remove freezing. An improved XidStopLimit mechanism might even end up kicking in *before* the oldest relfrozenxid reached 2 billion XIDs, depending on the specifics. Removing the failsafe mechanism seems misguided to me for similar reasons. I recently learned that Amazon RDS has set a *lower* vacuum_failsafe_age default than the standard default (its default of 1.6 billion to only 1.2 billion on RDS). This decision predates my joining AWS. It seems as if practical experience has shown that allowing any table's age(relfrozenxid) to get too far past a billion is not a good idea. At least it's not a good idea on modern Postgres versions, that have the freeze map. We really shouldn't have to rely on having billions of XIDs available in the first place -- XID space isn't really a fungible commodity. It's much more important to recognize that something (some specific thing) just isn't working as designed, which in general could be pretty far removed from freezing. For example, index corruption could do it (at least without the failsafe). Some kind of autovacuum starvation could do it. It's almost always more complicated than "not enough available XID space". -- Peter Geoghegan
On Mon, Nov 28, 2022 at 4:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > Granted, the specifics of the current XidStopLimit mechanism are > unlikely to directly carry over to 64-bit XIDs. XidStopLimit is > structured in a way that doesn't actually consider freeze debt in > units like unfrozen pages. Like Chris, I just don't see why the patch > obviates the need for something like XidStopLimit, since the patch > doesn't remove freezing. An improved XidStopLimit mechanism might even > end up kicking in *before* the oldest relfrozenxid reached 2 billion > XIDs, depending on the specifics. What is the purpose of using 64-bit XIDs, if not to avoid having to stop the world when we run short of XIDs? I'd say that if this patch, or any patch with broadly similar goals, fails to remove xidStopLimit, it might as well not exist. xidStopLimit is not a general defense against falling too far behind on freezing, and in general, there is no reason to think that we need such a defense. xidStopLimit is a defense against reusing the same XID and thus causing irreversible database corruption. When that possibility no longer exists, it has outlived its usefulness and we should be absolutely delighted to bid it adieu. It seems like you and Chris are proposing the moral equivalent of paying off your home loan but still sending a check to the mortgage company every month just to be sure they don't get mad. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 28, 2022 at 04:30:22PM -0500, Robert Haas wrote: > What is the purpose of using 64-bit XIDs, if not to avoid having to > stop the world when we run short of XIDs? > > I'd say that if this patch, or any patch with broadly similar goals, > fails to remove xidStopLimit, it might as well not exist. > > xidStopLimit is not a general defense against falling too far behind > on freezing, and in general, there is no reason to think that we need > such a defense. xidStopLimit is a defense against reusing the same XID > and thus causing irreversible database corruption. When that > possibility no longer exists, it has outlived its usefulness and we > should be absolutely delighted to bid it adieu. > > It seems like you and Chris are proposing the moral equivalent of > paying off your home loan but still sending a check to the mortgage > company every month just to be sure they don't get mad. I think the problem is that we still have bloat with 64-bit XIDs, specifically pg_xact and pg_multixact files. Yes, that bloat is less serious, but it is still an issue worth reporting in the server logs, though not serious enough to stop the server from write queries. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
On Mon, Nov 28, 2022 at 1:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > What is the purpose of using 64-bit XIDs, if not to avoid having to > stop the world when we run short of XIDs? I agree that the xidStopLimit mechanism was designed with the specific goal of preventing "true" physical XID wraparound that results in wrong answers to queries. It was not written with the intention of limiting the accumulation of freeze debt, which would have to use units like unfrozen heap pages to make any sense. That is a historic fact -- no question. I think that it is nevertheless quite possible that just refusing to allocate any more XIDs could make sense with 64-bit XIDs, where we don't strictly have to to keep the system fully functional. That might still be the lesser evil, in that other world. The cutoff itself would depend on many workload details, I suppose. Imagine if we actually had 64-bit XIDs -- let's assume for a moment that it's a done deal. This raises a somewhat awkward question: do you just let the system get further and further behind on freezing, forever? We can all agree that 2 billion XIDs is very likely the wrong time to start refusing new XIDs -- because it just isn't designed with debt in mind. But what's the right time, if any? How much debt is too much? At the very least these seem like questions that deserve serious consideration. > I'd say that if this patch, or any patch with broadly similar goals, > fails to remove xidStopLimit, it might as well not exist. Well, it could in principle be driven by lots of different kinds of information, and make better decisions by actually targeting freeze debt in some way. An "enhanced version of xidStopLimit with 64-bit XIDs" could kick in far far later than it currently would. Obviously that has some value. I'm not claiming to know how to build this "better xidStopLimit mechanism", by the way. I'm not seriously proposing it. Mostly I'm just saying that the question "where do you draw the line if not at 2 billion XIDs?" is a very pertinent question. It is not a question that is made any less valid by the fact that we already know that 2 billion XIDs is pretty far from optimal in almost all cases. Having some limit based on something is likely to be more effective than having no limit based on nothing. Admittedly this argument works a lot better with the failsafe than it does with xidStopLimit. Both are removed by the patch. -- Peter Geoghegan
On Mon, Nov 28, 2022 at 1:52 PM Bruce Momjian <bruce@momjian.us> wrote: > I think the problem is that we still have bloat with 64-bit XIDs, > specifically pg_xact and pg_multixact files. Yes, that bloat is less > serious, but it is still an issue worth reporting in the server logs, > though not serious enough to stop the server from write queries. That's definitely a big part of it. Again, I don't believe that the idea is fundamentally without merit. Just that it's not worth it, given that having more XID space is very much not something that I think fixes most of the problems. And given the real risk of serious bugs with something this invasive. I believe that it would be more useful to focus on just not getting into trouble in the first place, as well as on mitigating specific problems that lead to the system reaching xidStopLimit in practice. I don't think that there is any good reason to allow datfrozenxid to go past about a billion. When it does the interesting questions are questions about what went wrong, and how that specific failure can be mitigated in a fairly direct way. We've already used way to much "XID space runway", so why should using even more help? It might, I suppose, but it almost seems like a distraction to me, as somebody that wants to make things better for users in general. As long as the system continues to misbehave (in whatever way it happens to be misbehaving), why should any amount of XID space ever be enough? I think that we'll be able to get rid of freezing in a few years time. But as long as we have freezing, we have these problems. -- Peter Geoghegan
On Mon, Nov 28, 2022 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm not claiming to know how to build this "better xidStopLimit > mechanism", by the way. I'm not seriously proposing it. Mostly I'm > just saying that the question "where do you draw the line if not at 2 > billion XIDs?" is a very pertinent question. It is not a question that > is made any less valid by the fact that we already know that 2 billion > XIDs is pretty far from optimal in almost all cases. Having some limit > based on something is likely to be more effective than having no limit > based on nothing. > > Admittedly this argument works a lot better with the failsafe than it > does with xidStopLimit. Both are removed by the patch. Come to think of it, if you were going to do something like this it would probably work by throttling XID allocations, with a gradual ramp-up. It would rarely get to the point that the system refused to allocate XIDs completely. It's not fundamentally unreasonable to force the application to live within its means by throttling. Feedback that slows down the rate of writes is much more common in the LSM tree world, within systems like MyRocks [1], where the risk of the system being destabilized by debt is more pressing. As I said, I don't think that this is a particularly promising way of addressing problems with Postgres XID space exhaustion, since I believe that the underlying issue isn't usually a simple lack of "XID space slack capacity". But if you assume that I'm wrong here (if you assume that we very often don't have the ability to freeze lazily enough), then ISTM that throttling or feedback to stall new writes is a very reasonable option. In fact, it's practically mandatory. [1] https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit#slide=id.g8839c9d71b_0_79 -- Peter Geoghegan
I suppose I must not have been clear in what I am suggesting we do and why. I will try to answer specific points below and then restate what I think the problem is, and what I think should be done about it.
On Sat, Nov 26, 2022 at 4:08 AM Chris Travers <chris@orioledata.com> wrote:
> I didn't see any changes to pg_upgrade to make this change possible on upgrade. Is that also outside of the scope of your patch set? If so how is that continuity supposed to be ensured?
The scheme is documented in their 0006 patch, in a README.XID file.
I'm not entirely confident that it's the best design and have argued
against it in the past, but it's not crazy.
More generally, while I think there's plenty of stuff to be concerned
about in this patch set and while I'm somewhat skeptical about the
likelihood of its getting or staying committed, I can't really
understand your concerns in particular. The thrust of your concern
seems to be that if we allow people to get further behind, recovery
will be more difficult. I'm not sure I see the problem. Suppose that
we adopt this proposal and that it is bug-free. Now, consider a user
who gets 8 billion XIDs behind. They probably have to vacuum pretty
much every page in the database to do that, or least every page in the
tables that haven't been vacuumed recently. But that would likely also
be true if they were 800 million XIDs behind, as is possible today.
The effort to catch up doesn't increase linearly with how far behind
you are, and is always bounded by the DB size.
It is true that if the table is progressively bloating, it is likely
to be more bloated by the time you are 8 billion XIDs behind than it
was when you were 800 million XIDs behind. I don't see that as a very
good reason not to adopt this patch, because you can bloat the table
by an arbitrarily large amount while consuming only a small number of
XiDs, even just 1 XID. Protecting against bloat is good, but shutting
down the database when the XID age reaches a certain value is not a
particularly effective way of doing that, so saying that we'll be
hurting people by not shutting down the database at the point where we
do so today doesn't ring true to me. I think that most people who get
to the point of wraparound shutdown have workloads where bloat isn't a
huge issue, because those who do start having problems with the bloat
way before they run out of XIDs.
It would be entirely possible to add a parameter to the system that
says "hey, you know we can keep running even if we're a shazillion
XiDs behind, but instead shut down when we are behind by this number
of XIDs." Then, if somebody wants to force an automatic shutdown at
that point, they could, and I think that then the scenario you're
worried about just can't happen any more . But isn't that a little bit
silly? You could also just monitor how far behind you are and page the
DBA when you get behind by more than a certain number of XIDs. Then,
you wouldn't be risking a shutdown, and you'd still be able to stay on
top of the XID ages of your tables.
Philosophically, I disagree with the idea of shutting down the
database completely in any situation in which a reasonable alternative
exists. Losing read and write availability is really bad, and I don't
think it's what users want. I think that most users want the database
to degrade gracefully when things do not go according to plan.
Ideally, they'd like everything to Just Work, but reasonable users
understand that sometimes there are going to be problems, and in my
experience, what makes them happy is when the database acts to contain
the scope of the problem so that it affects their workload as little
as possible, rather than acting to magnify the problem so that it
impacts their workload as much as possible. This patch, implementation
and design concerns to one side, does that.
I don't believe there's a single right answer to the question of what
to do about vacuum falling behind, and I think it's worth exploring
multiple avenues to improve the situation. You can have vacuum never
run on a table at all, say because all of the workers are busy
elsewhere, or because the table is locked until the heat death of the
universe. You can have vacuum run on a table but too slowly to do any
good, because of the vacuum cost delay mechanism. You can have vacuum
run and finish but do little good because of prepared transactions or
replication slots or long-running queries. It's reasonable to think
about what kinds of steps might help in those different scenarios, and
especially to think about what kind of steps might help in multiple
cases. We should do that. But, I don't think any of that means that we
can ignore the need for some kind of expansion of the XID space
forever. Computers are getting faster. It's already possible to burn
through the XID space in hours, and the number of hours is going to go
down over time and maybe eventually the right unit will be minutes, or
even seconds. Sometime before then, we need to do something to make
the runway bigger, or else just give up on PostgreSQL being a relevant
piece of software.
Perhaps the thing we need to do is not exactly this, but if not, it's
probably a sibling or cousin of this.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Nov 28, 2022 at 1:52 PM Bruce Momjian <bruce@momjian.us> wrote:
> I think the problem is that we still have bloat with 64-bit XIDs,
> specifically pg_xact and pg_multixact files. Yes, that bloat is less
> serious, but it is still an issue worth reporting in the server logs,
> though not serious enough to stop the server from write queries.
That's definitely a big part of it.
Again, I don't believe that the idea is fundamentally without merit.
Just that it's not worth it, given that having more XID space is very
much not something that I think fixes most of the problems. And given
the real risk of serious bugs with something this invasive.
I believe that it would be more useful to focus on just not getting
into trouble in the first place, as well as on mitigating specific
problems that lead to the system reaching xidStopLimit in practice. I
don't think that there is any good reason to allow datfrozenxid to go
past about a billion. When it does the interesting questions are
questions about what went wrong, and how that specific failure can be
mitigated in a fairly direct way.
We've already used way to much "XID space runway", so why should using
even more help? It might, I suppose, but it almost seems like a
distraction to me, as somebody that wants to make things better for
users in general. As long as the system continues to misbehave (in
whatever way it happens to be misbehaving), why should any amount of
XID space ever be enough?
Of course, as I think we agree, the priorities should be (in order):
1. Avoid trouble
2. Recover from trouble early
3. Provide more and better options for recovery.
I think 64bit xids are a very good idea, but they really fit in this bottom tier. Not being up against mathematical limits to the software when things are going bad is certainly a good thing. But I am really worried about the attitude that this patch really avoids trouble because in many cases, I don;t think it does and therefore I believe we need to make sure we are not reducing visibility of underlying problems.
I think that we'll be able to get rid of freezing in a few years time.
But as long as we have freezing, we have these problems.
--
Peter Geoghegan
On Tue, Nov 29, 2022 at 02:35:20PM +0100, Chris Travers wrote: > So I think the problem is that PostgreSQL is becoming more and more scalabile, > hardware is becoming more capable, and certain use cases are continuing to > scale up. Over time, we tend to find ourselves approaching the end of the > runway at ever higher velocities. That's a problem that will get significantly > worse over time. > > Of course, as I think we agree, the priorities should be (in order): > 1. Avoid trouble > 2. Recover from trouble early > 3. Provide more and better options for recovery. Warn about trouble is another area we should focus on here. > I think 64bit xids are a very good idea, but they really fit in this bottom > tier. Not being up against mathematical limits to the software when things > are going bad is certainly a good thing. But I am really worried about the > attitude that this patch really avoids trouble because in many cases, I don;t > think it does and therefore I believe we need to make sure we are not reducing > visibility of underlying problems. As far as I know, all our freeze values are focused on avoiding XID wraparound. If XID wraparound is no longer an issue, we might find that our freeze limits can be much higher than they are now. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
On 11/29/22 09:46, Bruce Momjian wrote: > As far as I know, all our freeze values are focused on avoiding XID > wraparound. If XID wraparound is no longer an issue, we might find that > our freeze limits can be much higher than they are now. > I'd be careful in that direction as the values together with maintenance work mem also keep a lid on excessive index cleanup rounds. Regards, Jan
On Mon, Nov 28, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > Imagine if we actually had 64-bit XIDs -- let's assume for a moment > that it's a done deal. This raises a somewhat awkward question: do you > just let the system get further and further behind on freezing, > forever? We can all agree that 2 billion XIDs is very likely the wrong > time to start refusing new XIDs -- because it just isn't designed with > debt in mind. But what's the right time, if any? How much debt is too > much? I simply don't see a reason to ever stop the server entirely. I don't even agree with the idea of slowing down XID allocation, let alone refusing it completely. When the range of allocated XIDs become too large, several bad things happen. First, we become unable to allocate new XIDs without corrupting the database. Second, pg_clog and other SLRUs become uncomfortably large. There may be some other things too that I'm not thinking about. But these things are not all equally bad. If these were medical problems, being unable to allocate new XIDs without data corruption would be a heart attack, and SLRUs getting bigger on disk would be acne. You don't handle problems of such wildly differing severity in the same way. When someone is having a heart attack, an ambulance rushes them to the hospital, running red lights as necessary. When someone has acne, you don't take them to the same hospital in the same ambulance and drive it at a slower rate of speed. You do something else entirely, and it's something that is in every way much less dramatic. There's no such thing as an attack of acne that's so bad that it requires an ambulance ride, but even a mild heart attack should result in a fast trip to the ER. So here. The two problems are so qualitatively different that the responses should also be qualitatively different. > Admittedly this argument works a lot better with the failsafe than it > does with xidStopLimit. Both are removed by the patch. I don't think the failsafe stuff should be removed, but it should probably be modified in some way. Running out of XIDs is the only valid reason for stopping the world, at least IMO, but it is definitely NOT the only reason for vacuuming more aggressively. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 29, 2022 at 8:03 AM Chris Travers <chris@orioledata.com> wrote: > To be clear, I never suggested shutting down the database. What I have suggested is that repurposing the current approaching-xid-wraparoundwarnings to start complaining loudly when a threshold is exceeded would be helpful. I think itmakes sense to make that threshold configurable especially if we eventually have people running bloat-free table structures. > > There are two fundamental problems here. The first is that if, as you say, a table is progressively bloating and we aregetting further and further behind on vacuuming and freezing, something is seriously wrong and we need to do somethingabout it. In these cases, I my experience is that vacuuming and related tools tend to suffer degraded performance,and determining how to solve the problem takes quite a bit more time than a routine bloat issue would. So whatI am arguing against is treating the problem just as a bloat issue. If you get there due to vacuum being slow, somethingelse is wrong and you are probably going to have to find and fix that as well in order to catch up. At least that'smy experience. > > I don't object to the db continuing to run, allocate xids etc. What I object to is it doing so in silently where thingsare almost certainly going very wrong. OK. My feeling is that the set of things we can do to warn the user is somewhat limited. I'm open to trying our best, but we need to have reasonable expectations. Sophisticated users will be monitoring for problems even if we do nothing to warn, and dumb ones won't look at their logs. Any feature that proposes to warn must aim at the uses who are smart enough to check the logs but dumb enough not to have any more sophisticated monitoring. Such users certainly exist and are not even uncommon, but they aren't the only kind by a long shot. My argument is that removing xidStopLimit is totally fine, because it only serves to stop the database. What to do about xidWarnLimit is a slightly more complex question. Certainly it can't be left untouched, because warning that we're about to shut down the database for lack of allocatable XIDs is not sensible if there is no such lack and we aren't going to shut it down. But I'm also not sure if the model is right. Doing nothing for a long time and then warning in every transaction when some threshold is crossed is an extreme behavior change. Right now that's somewhat justified because we're about to hit a brick wall at full speed, but if we remove the brick wall and replace it with a gentle pelting with rotten eggs, it's unclear that a similarly strenuous reaction is the right model. But that's also not to say that we should do nothing at all. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 29, 2022 at 11:41:07AM -0500, Robert Haas wrote: > My argument is that removing xidStopLimit is totally fine, because it > only serves to stop the database. What to do about xidWarnLimit is a > slightly more complex question. Certainly it can't be left untouched, > because warning that we're about to shut down the database for lack of > allocatable XIDs is not sensible if there is no such lack and we > aren't going to shut it down. But I'm also not sure if the model is > right. Doing nothing for a long time and then warning in every > transaction when some threshold is crossed is an extreme behavior > change. Right now that's somewhat justified because we're about to hit > a brick wall at full speed, but if we remove the brick wall and > replace it with a gentle pelting with rotten eggs, it's unclear that a > similarly strenuous reaction is the right model. But that's also not > to say that we should do nothing at all. Yeah, we would probably need to warn on every 1 million transactions or something. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
On Mon, 28 Nov 2022 at 16:53, Peter Geoghegan <pg@bowt.ie> wrote: > > Imagine if we actually had 64-bit XIDs -- let's assume for a moment > that it's a done deal. This raises a somewhat awkward question: do you > just let the system get further and further behind on freezing, > forever? We can all agree that 2 billion XIDs is very likely the wrong > time to start refusing new XIDs -- because it just isn't designed with > debt in mind. But what's the right time, if any? How much debt is too > much? My first thought was... why not? Just let the system get further and further behind on freezing. Where's the harm? Picture an insert-only database that is receiving data very quickly never having data deleted or modified. vacuum takes several days to complete and the system wraps 32-bit xid several times a day. The DBA asks you why are they even bothering running vacuum? They have plenty of storage for clog, latency on selects is not a pain point, not compared to running multi-day vacuums that impact insert times.... That isn't far off the scenario where I've seen wraparound being a pain btw. Anti-wraparound vacuum took about 2 days and was kicking off pretty much as soon as the previous one finished. For a table that was mostly read-only. Of course to make the judgement the DBA needs to have good ways to measure the space usage of clog, and the overhead caused by clog lookups that could be avoided. Then they can judge for themselves how much freezing is appropriate. -- greg
On Tue, Nov 29, 2022 at 11:41:07AM -0500, Robert Haas wrote:
> My argument is that removing xidStopLimit is totally fine, because it
> only serves to stop the database. What to do about xidWarnLimit is a
> slightly more complex question. Certainly it can't be left untouched,
> because warning that we're about to shut down the database for lack of
> allocatable XIDs is not sensible if there is no such lack and we
> aren't going to shut it down. But I'm also not sure if the model is
> right. Doing nothing for a long time and then warning in every
> transaction when some threshold is crossed is an extreme behavior
> change. Right now that's somewhat justified because we're about to hit
> a brick wall at full speed, but if we remove the brick wall and
> replace it with a gentle pelting with rotten eggs, it's unclear that a
> similarly strenuous reaction is the right model. But that's also not
> to say that we should do nothing at all.
Yeah, we would probably need to warn on every 1 million transactions or
something.
The first is that noisy warnings are extremely easy to see. You get them in cron emails, from psql, in the db logs etc. Having them every million makes them harder to catch.
If someone has an insert only database and maye doesn't want to ever freeze, they can set the threshold to -1 or something. I would suggest keeping the default as at 2 billion to be in line with existing limitations and practices. People can then adjust as they see fit.
Warning text might be something like "XID Lag Threshold Exceeded. Is autovacuum clearing space and keeping up?"
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Tue, Nov 29, 2022 at 9:35 PM Chris Travers <chris@orioledata.com> wrote: > My proposal would be to make the threshold configurable and start warning on every transaction after that. There are acouple reasons to do that. > > The first is that noisy warnings are extremely easy to see. You get them in cron emails, from psql, in the db logs etc. Having them every million makes them harder to catch. > > The point here is not to ensure there are no problems, but to make sure that an existing layer in the current swiss cheesemodel of safety doesn't go away. Will it stop all problems? No. But the current warning strategy is effective, givenhow many times we hear of cases of people having to take drastic action to avoid impending xid wraparound. > > If someone has an insert only database and maye doesn't want to ever freeze, they can set the threshold to -1 or something. I would suggest keeping the default as at 2 billion to be in line with existing limitations and practices. Peoplecan then adjust as they see fit. > > Warning text might be something like "XID Lag Threshold Exceeded. Is autovacuum clearing space and keeping up?" None of this seems unreasonable to me. If we want to allow more configurability, we could also let you choose the threshold and the frequency of warnings (every N transactions). But, I think we might be getting down a little bit in the weeds. It's not clear that everybody's on board with the proposed page format changes. I'm not completely opposed, but I'm also not wild about the approach. It's probably not a good idea to spend all of our energy debating the details of how to reform xidWrapLimit without having some consensus on those points. It is, in a word, bikeshedding: on-disk page format changes are hard, but everyone understands warning messages. Lest we miss the forest for the trees, there is an aspect of this patch that I find to be an extremely good idea and think we should try to get committed even if the rest of the patch set ends up in the rubbish bin. Specifically, there are a couple of patches in here that have to do with making SLRUs indexed by 64-bit integers rather than by 32-bit integers. We've had repeated bugs in the area of handling SLRU wraparound in the past, some of which have caused data loss. Just by chance, I ran across a situation just yesterday where an SLRU wrapped around on disk for reasons that I don't really understand yet and chaos ensued. Switching to an indexing system for SLRUs that does not ever wrap around would probably enable us to get rid of a whole bunch of crufty code, and would also likely improve the general reliability of the system in situations where wraparound is threatened. It seems like a really, really good idea. I haven't checked the patches to see whether they look correct, and I'm concerned in particular about upgrade scenarios. But if there's a way we can get that part committed, I think it would be a clear win. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 30, 2022 at 8:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > I haven't checked the patches to see whether they look correct, and > I'm concerned in particular about upgrade scenarios. But if there's a > way we can get that part committed, I think it would be a clear win. +1 -- Peter Geoghegan
Hi, Robert! > Lest we miss the forest for the trees, there is an aspect of this > patch that I find to be an extremely good idea and think we should try > to get committed even if the rest of the patch set ends up in the > rubbish bin. Specifically, there are a couple of patches in here that > have to do with making SLRUs indexed by 64-bit integers rather than by > 32-bit integers. We've had repeated bugs in the area of handling SLRU > wraparound in the past, some of which have caused data loss. Just by > chance, I ran across a situation just yesterday where an SLRU wrapped > around on disk for reasons that I don't really understand yet and > chaos ensued. Switching to an indexing system for SLRUs that does not > ever wrap around would probably enable us to get rid of a whole bunch > of crufty code, and would also likely improve the general reliability > of the system in situations where wraparound is threatened. It seems > like a really, really good idea. I totally support the idea that the part related to SLRU is worth committing whether it is being the first step to 64xid or separately. This subset is discussed in a separate thread [1]. It seems that we need more time to reach a consensus on the implementation of a whole big thing. Just this discussion is a complicated thing and reveals many different aspects concurrently in one thread. So I'd vote for an evolutionary approach and give my +1 for undertaking efforts to first committing [1] to 16. [1]: https://www.postgresql.org/message-id/CAFiTN-uudj2PY8GsUzFtLYFpBoq_rKegW3On_8ZHdxB1mVv3-A%40mail.gmail.com Kind regards, Pavel Borisov, Supabase.
So I'd vote for an evolutionary approach and give my +1 for
undertaking efforts to first committing [1] to 16.
[1]: https://www.postgresql.org/message-id/CAFiTN-uudj2PY8GsUzFtLYFpBoq_rKegW3On_8ZHdxB1mVv3-A%40mail.gmail.com
Kind regards,
Pavel Borisov,
Supabase.
--
--
发送时间: 2022年12月28日 18:14
收件人: Pavel Borisov <pashkin.elfe@gmail.com>
抄送: Robert Haas <robertmhaas@gmail.com>; Chris Travers <chris@orioledata.com>; Bruce Momjian <bruce@momjian.us>; Aleksander Alekseev <aleksander@timescale.com>; pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>; Chris Travers <chris.travers@gmail.com>; Peter Geoghegan <pg@bowt.ie>; Fedor Sigaev <teodor@sigaev.ru>; Alexander Korotkov <aekorotkov@gmail.com>; Konstantin Knizhnik <knizhnik@garret.ru>; Nikita Glukhov <n.gluhov@postgrespro.ru>; Yura Sokolov <y.sokolov@postgrespro.ru>; Simon Riggs <simon.riggs@enterprisedb.com>
主题: Re: Add 64-bit XIDs into PostgreSQL 15
--
Attachment
Hi Maxim, > Anyway. Let's discuss on-disk page format, shall we? Here are my two cents. > AFAICS, we have a following options: > [...] > 2. Put special in every page where base for XIDs are stored. This is what we have done in the current patch set. The approach of using special space IMO is fine. I'm still a bit sceptical about the need to introduce a new entity "64-bit base XID" while we already have 32-bit XID epochs that will do the job. I suspect that having fewer entities helps to reason about the code and that it is important in the long run, but maybe it's just me. In any case, I don't have a strong opinion here. Additionally, I think we should be clear about the long term goals. As Peter G. pointed out above: > I think that we'll be able to get rid of freezing in a few years time. IMO eventually getting rid of freezing and "magic" XIDs will simplify the maintenance of the project and also make the behaviour of the system much more predictable. The user will have to worry only about the disk space reclamation. If we have a consensus that this is the final goal then we should definitely be moving toward 64-bit XIDs and perhaps even include a corresponding PoC to the patchset. If we want to keep freezing indefinitely then, as Chris Travers argued, 64-bit XIDs don't bring that much value and maybe the community should be focusing on something else. -- Best regards, Aleksander Alekseev
This patch hasn't applied in quite some time, and the thread has moved to discussing higher lever items rather than the suggested patch, so I'm closing this as Returned with Feedback. Please feel free to resubmit when there is renewed interest and a concensus on how/what to proceed with. -- Daniel Gustafsson
Hi, > This patch hasn't applied in quite some time, and the thread has moved to > discussing higher lever items rather than the suggested patch, so I'm closing > this as Returned with Feedback. Please feel free to resubmit when there is > renewed interest and a concensus on how/what to proceed with. Yes, this thread awaits several other patches to be merged [1] in order to continue, so it makes sense to mark it as RwF for the time being. Thanks! [1]: https://commitfest.postgresql.org/43/3489/ -- Best regards, Aleksander Alekseev
--
Attachment
- v52-0002-Use-64-bit-FullTransactionId-instead-of-Epoch-xi.patch
- v52-0001-Use-64-bit-format-to-output-XIDs.patch
- v52-0004-Use-64-bit-GUCs.patch
- v52-0003-Use-64-bit-SLRU-pages-in-callers.patch
- v52-0005-Use-64-bit-XIDs.patch
- v52-0006-Add-initdb-option-to-initialize-cluster-with-non.patch
- v52-0007-README.XID64.patch
Hi!Just to keep this thread up to date, here's a new version after recent changes in SLRU.I'm also change order of the patches in the set, to make adding initdb MOX options after the"core 64 xid" patch, since MOX patch is unlikely to be committed and now for test purpose only.
--Best regards,Maxim Orlov.
Hi Maxim OrlovGood news,xid64 has achieved a successful first phase,I tried to change the path status (https://commitfest.postgresql.org/43/3594/) ,But it seems incorrectMaxim Orlov <orlovmg@gmail.com> 于2023年12月13日周三 20:26写道:Hi!Just to keep this thread up to date, here's a new version after recent changes in SLRU.I'm also change order of the patches in the set, to make adding initdb MOX options after the"core 64 xid" patch, since MOX patch is unlikely to be committed and now for test purpose only.
Many thanks
Best whish
Hi, Wenhui!On Fri, 15 Dec 2023 at 05:52, wenhui qiu <qiuwenhuifx@gmail.com> wrote:Hi Maxim OrlovGood news,xid64 has achieved a successful first phase,I tried to change the path status (https://commitfest.postgresql.org/43/3594/) ,But it seems incorrectMaxim Orlov <orlovmg@gmail.com> 于2023年12月13日周三 20:26写道:Hi!Just to keep this thread up to date, here's a new version after recent changes in SLRU.I'm also change order of the patches in the set, to make adding initdb MOX options after the"core 64 xid" patch, since MOX patch is unlikely to be committed and now for test purpose only.If the patch is RwF the CF entry is finished and can't be enabled, rather the patch needs to be submitted in a new entry, which I have just done.Please feel free to submit your review.Kind regards,Pavel Borisov,Supabase
On Wed, Dec 13, 2023 at 5:56 PM Maxim Orlov <orlovmg@gmail.com> wrote: > > Hi! > > Just to keep this thread up to date, here's a new version after recent changes in SLRU. > I'm also change order of the patches in the set, to make adding initdb MOX options after the > "core 64 xid" patch, since MOX patch is unlikely to be committed and now for test purpose only. I tried to apply the patch but it is failing at the Head. It is giving the following error: Hunk #1 succeeded at 601 (offset 5 lines). patching file src/backend/replication/slot.c patching file src/backend/replication/walreceiver.c patching file src/backend/replication/walsender.c Hunk #1 succeeded at 2434 (offset 160 lines). patching file src/backend/storage/ipc/procarray.c Hunk #1 succeeded at 1115 with fuzz 2. Hunk #3 succeeded at 1286 with fuzz 2. Hunk #7 FAILED at 4341. Hunk #8 FAILED at 4899. Hunk #9 FAILED at 4959. 3 out of 10 hunks FAILED -- saving rejects to file src/backend/storage/ipc/procarray.c.rej patching file src/backend/storage/ipc/standby.c Hunk #1 FAILED at 1043. Hunk #2 FAILED at 1370. 2 out of 2 hunks FAILED -- saving rejects to file src/backend/storage/ipc/standby.c.rej Please send the Re-base version of the patch. Thanks and Regards, Shubham Khanna.
Hi, > Please send the Re-base version of the patch. PFA the rebased patchset. In order to keep the scope reasonable I suggest we focus on 0001...0005 for now. 0006+ are difficult to rebase / review and I'm a bit worried for the committer who will merge them. We can return to 0006+ when we deal with the first 5 patches. These patches can be delivered to PG18 independently, as we did with SLRU in the PG17 cycle. Tested on Intel MacOS w/ Autotools and ARM Linux w/ Meson. -- Best regards, Aleksander Alekseev
Attachment
Hi, > > Please send the Re-base version of the patch. > > PFA the rebased patchset. > > In order to keep the scope reasonable I suggest we focus on > 0001...0005 for now. 0006+ are difficult to rebase / review and I'm a > bit worried for the committer who will merge them. We can return to > 0006+ when we deal with the first 5 patches. These patches can be > delivered to PG18 independently, as we did with SLRU in the PG17 > cycle. > > Tested on Intel MacOS w/ Autotools and ARM Linux w/ Meson. cfbot revealed a bug in 0004: ``` ../src/backend/access/common/reloptions.c:1842:26: runtime error: store to misaligned address 0x55a5c6d64d94 for type 'int64', which requires 8 byte alignment 0x55a5c6d64d94: note: pointer points here ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff ^ ==18945==Using libbacktrace symbolizer. #0 0x55a5c4a9b273 in fillRelOptions ../src/backend/access/common/reloptions.c:1842 #1 0x55a5c4a9c709 in build_reloptions ../src/backend/access/common/reloptions.c:2002 #2 0x55a5c4a9c739 in default_reloptions ../src/backend/access/common/reloptions.c:1957 #3 0x55a5c4a9ccc0 in heap_reloptions ../src/backend/access/common/reloptions.c:2109 #4 0x55a5c4da98d6 in DefineRelation ../src/backend/commands/tablecmds.c:858 #5 0x55a5c52549ac in ProcessUtilitySlow ../src/backend/tcop/utility.c:1164 #6 0x55a5c52545a2 in standard_ProcessUtility ../src/backend/tcop/utility.c:1067 #7 0x55a5c52546fd in ProcessUtility ../src/backend/tcop/utility.c:523 #8 0x55a5c524fe5b in PortalRunUtility ../src/backend/tcop/pquery.c:1158 #9 0x55a5c5250531 in PortalRunMulti ../src/backend/tcop/pquery.c:1315 #10 0x55a5c5250bd6 in PortalRun ../src/backend/tcop/pquery.c:791 #11 0x55a5c5249e44 in exec_simple_query ../src/backend/tcop/postgres.c:1274 #12 0x55a5c524cc07 in PostgresMain ../src/backend/tcop/postgres.c:4680 #13 0x55a5c524d18c in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4136 #14 0x55a5c4f155e2 in main ../src/backend/main/main.c:194 #15 0x7fc3a77e5d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09) #16 0x55a5c4a6a249 in _start (/tmp/cirrus-ci-build/build/tmp_install/usr/local/pgsql/bin/postgres+0x8e1249) Aborted (core dumped) ``` Here is the fix. It can be tested like this: ``` --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -1839,6 +1839,7 @@ fillRelOptions(void *rdopts, Size basesize, ((relopt_int *) options[i].gen)->default_val; break; case RELOPT_TYPE_INT64: + Assert((((uint64)itempos) & 0x7) == 0); *(int64 *) itempos = options[i].isset ? options[i].values.int64_val : ((relopt_int64 *) options[i].gen)->default_val; ``` -- Best regards, Aleksander Alekseev
Attachment
Hi, > Here is the fix. It can be tested like this: > [...] PFA the rebased patchset. -- Best regards, Aleksander Alekseev
Attachment
On 23.07.24 11:13, Aleksander Alekseev wrote: >> Here is the fix. It can be tested like this: >> [...] > > PFA the rebased patchset. I'm wondering about the 64-bit GUCs. At first, it makes sense that if there are settings that are counted in terms of transactions, and transaction numbers are 64-bit integers, then those settings should accept 64-bit integers. But what is the purpose and effect of setting those parameters to such huge numbers? For example, what is the usability of being able to set vacuum_failsafe_age = 500000000000 I think in the world of 32-bit transaction IDs, you can intuitively interpret most of these "transaction age" settings as "percent toward disaster". For example, vacuum_freeze_table_age = 150000000 is 7% toward disaster, and vacuum_failsafe_age = 1600000000 is 75% toward disaster. However, if there is no more disaster threshold at 2^31, what is the guidance for setting these? Or more radically, why even run transaction-count-based vacuum at all? Conversely, if there is still some threshold (not disaster, but efficiency or something else), would it still be useful to keep these settings well below 2^31? In which case, we might not need 64-bit GUCs. Your 0004 patch adds support for 64-bit GUCs but doesn't actually convert any existing GUCs to use that. (Unlike the reloptions, which your patch coverts.) And so there is no documentation about these questions.
On 25/07/2024 13:19, Peter Eisentraut wrote: > I'm wondering about the 64-bit GUCs. > > At first, it makes sense that if there are settings that are counted in > terms of transactions, and transaction numbers are 64-bit integers, then > those settings should accept 64-bit integers. > > But what is the purpose and effect of setting those parameters to such > huge numbers? For example, what is the usability of being able to set > > vacuum_failsafe_age = 500000000000 > > I think in the world of 32-bit transaction IDs, you can intuitively > interpret most of these "transaction age" settings as "percent toward > disaster". For example, > > vacuum_freeze_table_age = 150000000 > > is 7% toward disaster, and > > vacuum_failsafe_age = 1600000000 > > is 75% toward disaster. > > However, if there is no more disaster threshold at 2^31, what is the > guidance for setting these? Or more radically, why even run > transaction-count-based vacuum at all? To allow the CLOG to be truncated. There's no disaster anymore, but without freezing, the clog will grow indefinitely. > Conversely, if there is still some threshold (not disaster, but > efficiency or something else), would it still be useful to keep these > settings well below 2^31? In which case, we might not need 64-bit GUCs. Yeah, I don't think it's critical. It makes sense to switch to 64 bit GUCs, so that you can make those settings higher, but it's not critical or strictly required for the rest of the work. Another approach is to make the GUCs to mean "thousands of XIDs", similar to how many of our memory settings are in kB rather than bytes. that might be a super confusing change for existing settings though. -- Heikki Linnakangas Neon (https://neon.tech)
On 23.07.24 11:13, Aleksander Alekseev wrote:
>> Here is the fix. It can be tested like this:
>> [...]
>
> PFA the rebased patchset.
I'm wondering about the 64-bit GUCs.
At first, it makes sense that if there are settings that are counted in
terms of transactions, and transaction numbers are 64-bit integers, then
those settings should accept 64-bit integers.
But what is the purpose and effect of setting those parameters to such
huge numbers? For example, what is the usability of being able to set
vacuum_failsafe_age = 500000000000
I think in the world of 32-bit transaction IDs, you can intuitively
interpret most of these "transaction age" settings as "percent toward
disaster". For example,
vacuum_freeze_table_age = 150000000
is 7% toward disaster, and
vacuum_failsafe_age = 1600000000
is 75% toward disaster.
However, if there is no more disaster threshold at 2^31, what is the
guidance for setting these? Or more radically, why even run
transaction-count-based vacuum at all?
Conversely, if there is still some threshold (not disaster, but
efficiency or something else), would it still be useful to keep these
settings well below 2^31? In which case, we might not need 64-bit GUCs.
Your 0004 patch adds support for 64-bit GUCs but doesn't actually
convert any existing GUCs to use that. (Unlike the reloptions, which
your patch coverts.) And so there is no documentation about these
questions.
On 25.07.24 13:09, Heikki Linnakangas wrote: >> However, if there is no more disaster threshold at 2^31, what is the >> guidance for setting these? Or more radically, why even run >> transaction-count-based vacuum at all? > > To allow the CLOG to be truncated. There's no disaster anymore, but > without freezing, the clog will grow indefinitely. Maybe a setting similar to max_wal_size could be better for that?
+1
Thanks
On 25.07.24 13:09, Heikki Linnakangas wrote:
>> However, if there is no more disaster threshold at 2^31, what is the
>> guidance for setting these? Or more radically, why even run
>> transaction-count-based vacuum at all?
>
> To allow the CLOG to be truncated. There's no disaster anymore, but
> without freezing, the clog will grow indefinitely.
Maybe a setting similar to max_wal_size could be better for that?
--
Michael, Maxim, > Apparently, the original thread will inevitably disintegrate into many separate ones. > For me, looks like some kind of road map. One for 64-bit GUCs, another one to remove > short file names from SLRUs and, to make things more complicated, [1] for cf entry [0], > to get rid of MultiXactOffset wraparound by switching to 64 bits. > > [0] https://commitfest.postgresql.org/49/5205/ > [1] https://www.postgresql.org/message-id/flat/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw%40mail.gmail.com Agree. To me it seems like we didn't reach a consensus regarding switching to 64-bit XIDs. Given that and the fact that the patchset is rather difficult to rebase (and review) I suggest focusing on something we reached a consensus for. I'm going to close a CF entry for this particular thread as RwF unless anyone objects. We can always return to this later, preferably knowing that there is a particular committer who has time and energy for merging this. -- Best regards, Aleksander Alekseev
Hi, > Agree. > > To me it seems like we didn't reach a consensus regarding switching to > 64-bit XIDs. Given that and the fact that the patchset is rather > difficult to rebase (and review) I suggest focusing on something we > reached a consensus for. I'm going to close a CF entry for this > particular thread as RwF unless anyone objects. We can always return > to this later, preferably knowing that there is a particular committer > who has time and energy for merging this. I started a new thread and opened a new CF entry for Int64 GUCs: https://commitfest.postgresql.org/50/5253/ -- Best regards, Aleksander Alekseev
The v56 patch version includes redundant additions which are not quite
relevant to the xid64. The v57 version is refined.
Best regards,
Evgeny Voropaev,
Tantor Labs, LLC.
<evorop@gmail.com>
<evgeny.voropaev@tantorlabs.ru>
> Hi Evgeny > xid64 path split several threads ,The current one should be > this:(https://www.postgresql.org/message-id/flat/CACG=ezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com) > ,We can do some tests on path so that can merge earlier > Thanks Hello Wenhui! Thank you for pointing out the specific topic developing the Multixact ID Offset MXIDOFF64 solution. I agree to the fullest extent with community's decision to split the xid64 into smaller parts and subsequently merge them into the master. In the meantime, abandoning the full version of the original patch and developing only its fragments, we are causing the target solution to become staled. That is why I have tried to put the whole solution together. The xid64 patch (v56-v58) includes the MXIDOFF patch as a distinct files. The first four patch-files of the xid64 are exactly the 9th version of the MXIDOFF64. If new versions of the xid64 patch update the parts relevant to MXIDOFF64, we will add them into MXIDOFF64 thread as a next version. I also hope you and the community will support attempts to retain the whole xid64 solution in the consistent state because only the consistent state allows us to continually test and fix the xid64 solution. Best regards, Evgeny Voropaev, Tantor Labs LLC.
> state allows us to continually test and fix the xid64 solution.
> Hi Evgeny
> xid64 path split several threads ,The current one should be
> this:(https://www.postgresql.org/message-id/flat/CACG=ezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com)
> ,We can do some tests on path so that can merge earlier
> Thanks
Hello Wenhui!
Thank you for pointing out the specific topic developing the Multixact
ID Offset MXIDOFF64 solution. I agree to the fullest extent with
community's decision to split the xid64 into smaller parts and
subsequently merge them into the master. In the meantime, abandoning the
full version of the original patch and developing only its fragments, we
are causing the target solution to become staled. That is why I have
tried to put the whole solution together. The xid64 patch (v56-v58)
includes the MXIDOFF patch as a distinct files. The first four
patch-files of the xid64 are exactly the 9th version of the MXIDOFF64.
If new versions of the xid64 patch update the parts relevant to
MXIDOFF64, we will add them into MXIDOFF64 thread as a next version. I
also hope you and the community will support attempts to retain the
whole xid64 solution in the consistent state because only the consistent
state allows us to continually test and fix the xid64 solution.
Best regards,
Evgeny Voropaev,
Tantor Labs LLC.