Thread: Proposal to add page headers to SLRU pages
Hi all,
PostgreSQL currently maintains several data structures in the SLRU cache. The current SLRU pages do not have any header, so it is impossible to checksum a page and verify its integrity. It is very difficult to debug issues caused by corrupted SLRU pages. Also, without a page header, page LSN is tracked in an ad-hoc fashion using LSN groups, which requires additional data structure in the shared memory. At eBay, we are building on the patch shared by Rishu Bagga in [1], which adds the standard PageHeaderData to each SLRU page. We believe that adding the standard page header to each SLRU page is the correct approach for the long run. It adds a checksum to each SLRU page, tracks page LSN as if it is a standard page and eases future page enhancements.
The enclosed patch changes the address calculation logic for all 7 SLRUs in the following 6 files:
src/backend/access/transam/clog.c
src/backend/access/transam/commit_ts.c
src/backend/access/transam/multixact.c
src/backend/access/transam/subtrans.c
src/backend/commands/async.c
src/backend/storage/lmgr/predicate.c
The patch enables page checksum with changes to the following 2 files:
src/backend/access/transam/slru.c
src/bin/pg_checksums/pg_checksums.c
The patch removes the group LSNs defined for each SLRU cache. See changes to:
src/include/access/slru.h
The patch adds a few helper macros in the following files:
src/backend/storage/page/bufpage.c
src/include/storage/bufpage.h
The patch updates some test cases:
src/bin/pg_resetwal/t/001_basic.pl
src/test/modules/test_slru/test_slru.c
I am still working on patching the pg_upgrade. Just love to hear your thoughts on the idea and the current patch.
Discussed with: Anton Shyrabokau and Shawn Debnath
[1] https://www.postgresql.org/message-id/flat/EFAAC0BE-27E9-4186-B925-79B7C696D5AC%40amazon.com
Regards,
Yong
Attachment
On 7 Dec 2023, at 10:06, Li, Yong <yoli@ebay.com> wrote:I am still working on patching the pg_upgrade. Just love to hear your thoughts on the idea and the current patch.
Hi, > +1 to the idea to protect SLRUs from corruption. I'm slightly leaning towards the idea of separating checksums from datapages, but anyway this checksums are better than no checksums. > > On 7 Dec 2023, at 10:06, Li, Yong <yoli@ebay.com> wrote: > > I am still working on patching the pg_upgrade. Just love to hear your thoughts on the idea and the current patch. > > FWIW you can take upgrade code from this patch [0] doing all the same stuff :) Sounds like a half-measure to me. If we really want to go down this rabbit hole IMO SLRU should be moved to shared buffers as proposed elsewhere [1]. [1]: https://www.postgresql.org/message-id/EFAAC0BE-27E9-4186-B925-79B7C696D5AC@amazon.com -- Best regards, Aleksander Alekseev
>> Sounds like a half-measure to me. If we really want to go down this >> rabbit hole IMO SLRU should be moved to shared buffers as proposed >> elsewhere [1]. > Thread that I cited stopped in 2018 for this exact reason. 5 years ago. Is this argument still valid? Meanwhile checksums of buffer pages also reside on a page :) I would love to have seen more progress on the set of threads that proposed the page header and integration of SLRU into buffer cache. The changes were large, and unfortunately as a result, it didn't get the detailed review that it needed. The complex nature of the feature allowed for more branches to be split from the main thread with alternative approaches. Athough this is great to see, it did result in the set of core requirements around LSN and checksum tracking via page headers to not get into PG 16. What is being proposed now is the simple and core functionality of introducing page headers to SLRU pages while continuing to be in the SLRU cache. This allows the whole project to be iterative and reviewers to better reason about the smaller set of changes being introduced into the codebase. Once the set of on-disk changes are in, we can follow up on optimizations. It may be moving to buffer cache or reviewing Dilip's approach in [1], we will have the option to be flexible in our approach. [1] https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com Shawn
On Thu, Dec 7, 2023 at 1:28 PM Debnath, Shawn <sdn@ebay.com> wrote: > What is being proposed now is the simple and core functionality of introducing > page headers to SLRU pages while continuing to be in the SLRU cache. This > allows the whole project to be iterative and reviewers to better reason about > the smaller set of changes being introduced into the codebase. > > Once the set of on-disk changes are in, we can follow up on optimizations. > It may be moving to buffer cache or reviewing Dilip's approach in [1], we > will have the option to be flexible in our approach. I basically agree with this. I don't think we should let the perfect be the enemy of the good. Shooting down this patch because it doesn't do everything that we want is a recipe for getting nothing done at all. That said, I don't think that the original post on this thread provides a sufficiently clear and detailed motivation for making this change. For this to eventually be committed, it's going to need (among other things) a commit message that articulates a convincing rationale for whatever changes it makes. Here's what the original email said: > It adds a checksum to each SLRU page, tracks page LSN as if it is a standard page and eases future page enhancements. Of those three things, in my opinion, the first is good and the other two are too vague. I assume that most people who would be likely to read a commit message would understand the value of pages having checksums. But I can't immediately think of what the value of tracking the page LSN as if it were a standard page might be, so that probably needs more explanation. Similarly, at least one or two of the future page enhancements that might be eased should be spelled out, and/or the ways in which they would be made easier should be articulated. -- Robert Haas EDB: http://www.enterprisedb.com
Given so many different approaches were discussed, I have started a wiki to record and collaborate all efforts towards SLRU improvements. The wiki provides a concise overview of all the ideas discussed and can serve as a portal for all historical discussions. Currently, the wiki summarizes four recent threads ranging from identifier format change to page header change, to moving SLRU into the main buffer pool, to reduce lock contention on SLRU latches. We can keep the patch related discussions in this thread and use the wiki as a live document for larger scale collaborations.
The wiki page is here: https://wiki.postgresql.org/wiki/SLRU_improvements
Regarding the benefits of this patch, here is a detailed explanation:
- Checksum is added to each page, allowing us to verify if a page has been corrupted when read from the disk.
- The ad-hoc LSN group structure is removed from the SLRU cache control data and is replaced by the page LSN in the page header. This allows us to use the same WAL protocol as used by pages in the main buffer pool: flush all redo logs up to the page LSN before flushing the page itself. If we move SLRU caches into the main buffer pool, this change fits naturally.
- It leaves further optimizations open. We can continue to pursue the goal of moving SLRU into the main buffer pool, or we can follow the lock partition idea. This change by itself does not conflict with either proposal.
Also, the patch is now complete and is ready for review. All check-world tests including tap tests passed with this patch.
Regards,
Yong
From: Robert Haas <robertmhaas@gmail.com>
Date: Friday, December 8, 2023 at 03:51
To: Debnath, Shawn <sdn@ebay.com>
Cc: Andrey Borodin <x4mmm@yandex-team.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Aleksander Alekseev <aleksander@timescale.com>, Li, Yong <yoli@ebay.com>, Shyrabokau, Anton <antons@ebay.com>, Bagga, Rishu <bagrishu@amazon.com>
Subject: Re: Proposal to add page headers to SLRU pages
External Email
On Thu, Dec 7, 2023 at 1:28 PM Debnath, Shawn <sdn@ebay.com> wrote:
> What is being proposed now is the simple and core functionality of introducing
> page headers to SLRU pages while continuing to be in the SLRU cache. This
> allows the whole project to be iterative and reviewers to better reason about
> the smaller set of changes being introduced into the codebase.
>
> Once the set of on-disk changes are in, we can follow up on optimizations.
> It may be moving to buffer cache or reviewing Dilip's approach in [1], we
> will have the option to be flexible in our approach.
I basically agree with this. I don't think we should let the perfect
be the enemy of the good. Shooting down this patch because it doesn't
do everything that we want is a recipe for getting nothing done at
all.
That said, I don't think that the original post on this thread
provides a sufficiently clear and detailed motivation for making this
change. For this to eventually be committed, it's going to need (among
other things) a commit message that articulates a convincing rationale
for whatever changes it makes. Here's what the original email said:
> It adds a checksum to each SLRU page, tracks page LSN as if it is a standard page and eases future page enhancements.
Of those three things, in my opinion, the first is good and the other
two are too vague. I assume that most people who would be likely to
read a commit message would understand the value of pages having
checksums. But I can't immediately think of what the value of tracking
the page LSN as if it were a standard page might be, so that probably
needs more explanation. Similarly, at least one or two of the future
page enhancements that might be eased should be spelled out, and/or
the ways in which they would be made easier should be articulated.
--
Robert Haas
EDB: https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&data=05%7C01%7Cyoli%40ebay.com%7C2cad2fe1de8a40f3167608dbf75de73c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638375754901646398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hkccuGfpt1%2BKxuhk%2BJt%2F3HyYuJqQHYfizib76%2F9HtUU%3D&reserved=0
Attachment
On Thu, Dec 8, 2023 at 1:36 AM Li, Yong <yoli@ebay.com> wrote: >Given so many different approaches were discussed, I have started a >wiki to record and collaborate all efforts towards SLRU >improvements. The wiki provides a concise overview of all the ideas >discussed and can serve as a portal for all historical >discussions. Currently, the wiki summarizes four recent threads >ranging from identifier format change to page header change, to moving >SLRU into the main buffer pool, to reduce lock contention on SLRU >latches. We can keep the patch related discussions in this thread and >use the wiki as a live document for larger scale collaborations. >The wiki page is >here: https://wiki.postgresql.org/wiki/SLRU_improvements >Regarding the benefits of this patch, here is a detailed explanation: 1. Checksum is added to each page, allowing us to verify if a page has been corrupted when read from the disk. 2. The ad-hoc LSN group structure is removed from the SLRU cache control data and is replaced by the page LSN in the page header. This allows us to use the same WAL protocol as used by pages in the main buffer pool: flush all redo logs up to the page LSN before flushing the page itself. If we move SLRU caches into the main buffer pool, this change fits naturally. 3. It leaves further optimizations open. We can continue to pursue the goal of moving SLRU into the main buffer pool, or we can follow the lock partition idea. This change by itself does not conflict with either proposal. >Also, the patch is now complete and is ready for review. All check- >world tests including tap tests passed with this patch. Hi Yong, I agree we should break the effort for the SLRU optimization into smaller chunks after having worked on some of the bigger patches and facing difficulty in making progress that way. The patch looks mostly good to me; though one thing that I thought about differently with the upgrade portion is where we should keep the logic of re-writing the CLOG files. There is a precedent introduced back in Postgres v9.6 in making on disk page format changes across different in visibility map: [1] code comment: * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's * visibility map included one bit per heap page; it now includes two. * When upgrading a cluster from before that time to a current PostgreSQL * version, we could refuse to copy visibility maps from the old cluster * to the new cluster; the next VACUUM would recreate them, but at the * price of scanning the entire table. So, instead, we rewrite the old * visibility maps in the new format. This work is being done in file.c – it seems to me the proper way to proceed would be to continue writing on-disk upgrade logic here. Besides that this looks good to me, would like to hear what others have to say. Thanks, Rishu Bagga Amazon Web Services (AWS) [1] https://github.com/postgres/postgres/commit/7087166a88fe0c04fc6636d0d6d6bea1737fc1fb
> This work is being done in file.c – it seems to me the proper way to > proceed would be to continue writing on-disk upgrade logic here. > Besides that this looks good to me, would like to hear what others have to say. Thank you, Rishu for taking time to review the code. I've updated the patch and moved the on-disk upgrade logic to pg_upgrade/file.c. I have also added this thread to the current Commitfest and hope this patch will be part of the 17 release. The commitfest link: https://commitfest.postgresql.org/46/4709/ Regards, Yong,
Attachment
Hi, > I have also added this thread to the current Commitfest and hope this patch > will be part of the 17 release. > > The commitfest link: > https://commitfest.postgresql.org/46/4709/ Thanks for the updated patch. cfbot seems to have some complaints regarding compiler warnings and also building the patch on Windows: http://cfbot.cputube.org/ -- Best regards, Aleksander Alekseev
> On Jan 2, 2024, at 19:35, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Thanks for the updated patch. > > cfbot seems to have some complaints regarding compiler warnings and > also building the patch on Windows: > > http://cfbot.cputube.org/ Thanks for the information. Here is the updated patch. Regards, Yong
Attachment
Regards,
Yong
Attachment
Regards,
Yong
Attachment
Hello, I suppose this is important to do if we ever want to move SLRUs into shared buffers. However, I wonder about the extra time this adds to pg_upgrade. Is this something we should be concerned about? Is there any measurement/estimates to tell us how long this would be? Right now, if you use a cloning strategy for the data files, the upgrade should be pretty quick ... but the amount of data in pg_xact and pg_multixact could be massive, and the rewrite is likely to take considerable time. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)
Greetings, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > I suppose this is important to do if we ever want to move SLRUs into > shared buffers. However, I wonder about the extra time this adds to > pg_upgrade. Is this something we should be concerned about? Is there > any measurement/estimates to tell us how long this would be? Right now, > if you use a cloning strategy for the data files, the upgrade should be > pretty quick ... but the amount of data in pg_xact and pg_multixact > could be massive, and the rewrite is likely to take considerable time. While I definitely agree that there should be some consideration of this concern, it feels on-par with the visibility-map rewrite which was done previously. Larger systems will likely have more to deal with than smaller systems, but it's still a relatively small portion of the data overall. The benefit of this change, beyond just the possibility of moving them into shared buffers some day in the future, is that this would mean that SLRUs will have checksums (if the cluster has them enabled). That benefit strikes me as well worth the cost of the rewrite taking some time and the minor loss of space due to the page header. Would it be useful to consider parallelizing this work? There's already parts of pg_upgrade which can be parallelized and so this isn't, hopefully, a big lift to add, but I'm not sure if there's enough work being done here CPU-wise, compared to the amount of IO being done, to have it make sense to run it in parallel. Might be worth looking into though, at least, as disks have gotten to be quite fast. Thanks! Stephen
Attachment
> On Mar 7, 2024, at 03:09, Stephen Frost <sfrost@snowman.net> wrote: > > External Email > > From: Stephen Frost <sfrost@snowman.net> > Subject: Re: Proposal to add page headers to SLRU pages > Date: March 7, 2024 at 03:09:59 GMT+8 > To: Alvaro Herrera <alvherre@alvh.no-ip.org> > Cc: "Li, Yong" <yoli@ebay.com>, Aleksander Alekseev <aleksander@timescale.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,"Bagga, Rishu" <bagrishu@amazon.com>, Robert Haas <robertmhaas@gmail.com>, "Debnath,Shawn" <sdn@ebay.com>, Andrey Borodin <x4mmm@yandex-team.ru>, "Shyrabokau, Anton" <antons@ebay.com> > > > Greetings, > > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: >> I suppose this is important to do if we ever want to move SLRUs into >> shared buffers. However, I wonder about the extra time this adds to >> pg_upgrade. Is this something we should be concerned about? Is there >> any measurement/estimates to tell us how long this would be? Right now, >> if you use a cloning strategy for the data files, the upgrade should be >> pretty quick ... but the amount of data in pg_xact and pg_multixact >> could be massive, and the rewrite is likely to take considerable time. > > While I definitely agree that there should be some consideration of > this concern, it feels on-par with the visibility-map rewrite which was > done previously. Larger systems will likely have more to deal with than > smaller systems, but it's still a relatively small portion of the data > overall. > > The benefit of this change, beyond just the possibility of moving them > into shared buffers some day in the future, is that this would mean that > SLRUs will have checksums (if the cluster has them enabled). That > benefit strikes me as well worth the cost of the rewrite taking some > time and the minor loss of space due to the page header. > > Would it be useful to consider parallelizing this work? There's already > parts of pg_upgrade which can be parallelized and so this isn't, > hopefully, a big lift to add, but I'm not sure if there's enough work > being done here CPU-wise, compared to the amount of IO being done, to > have it make sense to run it in parallel. Might be worth looking into > though, at least, as disks have gotten to be quite fast. > > Thanks! > > Stephen > Thank Alvaro and Stephen for your thoughtful comments. I did a quick benchmark regarding pg_upgrade time, and here are the results. Hardware spec: MacBook Pro M1 Max - 10 cores, 64GB memory, 1TB Apple SSD Operating system: macOS 14.3.1 Complier: Apple clang 15.0.0 Compiler optimization level: -O2 ==== PG setups: Old cluster: PG 16.2 release (source build) New cluster: PG Git HEAD plus the patch (source build) ==== Benchmark steps: 1. Initdb for PG 16.2. 2. Initdb for PG HEAD. 3. Run pg_upgrade on the above empty database, and time the overall wall clock time. 4. In the old cluster, write 512MB all-zero dummy segment files (2048 segments) under pg_xact. 5. In the old cluster, write 512MB all-zero dummy segment files under pg_multixact/members. 6. In the old cluster, write 512MB all-zero dummy segment files under pg_multixact/offsets. 7. Purge the OS page cache. 7. Run pg_upgrade again, and time the overall wall clock time. ==== Test result: On the empty database, pg_upgrade took 4.8 seconds to complete. With 1.5GB combined SLRU data to convert, pg_upgrade took 11.5 seconds to complete. It took 6.7 seconds to convert 1.5GB SLRU files for pg_upgrade. ==== For clog, 2048 segments can host about 2 billion transactions, right at the limit for wraparound. That’s the maximum we can have. 2048 segments are also big for pg_multixact SLRUs. Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 seconds longer. Regards, Yong
Greetings, * Li, Yong (yoli@ebay.com) wrote: > > On Mar 7, 2024, at 03:09, Stephen Frost <sfrost@snowman.net> wrote: > > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > >> I suppose this is important to do if we ever want to move SLRUs into > >> shared buffers. However, I wonder about the extra time this adds to > >> pg_upgrade. Is this something we should be concerned about? Is there > >> any measurement/estimates to tell us how long this would be? Right now, > >> if you use a cloning strategy for the data files, the upgrade should be > >> pretty quick ... but the amount of data in pg_xact and pg_multixact > >> could be massive, and the rewrite is likely to take considerable time. > > > > While I definitely agree that there should be some consideration of > > this concern, it feels on-par with the visibility-map rewrite which was > > done previously. Larger systems will likely have more to deal with than > > smaller systems, but it's still a relatively small portion of the data > > overall. > > > > The benefit of this change, beyond just the possibility of moving them > > into shared buffers some day in the future, is that this would mean that > > SLRUs will have checksums (if the cluster has them enabled). That > > benefit strikes me as well worth the cost of the rewrite taking some > > time and the minor loss of space due to the page header. > > > > Would it be useful to consider parallelizing this work? There's already > > parts of pg_upgrade which can be parallelized and so this isn't, > > hopefully, a big lift to add, but I'm not sure if there's enough work > > being done here CPU-wise, compared to the amount of IO being done, to > > have it make sense to run it in parallel. Might be worth looking into > > though, at least, as disks have gotten to be quite fast. > > Thank Alvaro and Stephen for your thoughtful comments. > > I did a quick benchmark regarding pg_upgrade time, and here are the results. > For clog, 2048 segments can host about 2 billion transactions, right at the limit for wraparound. > That’s the maximum we can have. 2048 segments are also big for pg_multixact SLRUs. > > Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 seconds longer. Thanks for testing! That strikes me as perfectly reasonable and seems unlikely that we'd get much benefit from parallelizing it, so I'd say it makes sense to keep this code simple. Thanks again! Stephen
Attachment
On 2024-Mar-08, Stephen Frost wrote: > Thanks for testing! That strikes me as perfectly reasonable and seems > unlikely that we'd get much benefit from parallelizing it, so I'd say it > makes sense to keep this code simple. Okay, agreed, that amount of time sounds reasonable to me too; but I don't want to be responsible for this at least for pg17. If some other committer wants to take it, be my guest. However, I think this is mostly a foundation for building other things on top, so committing during the last commitfest is perhaps not very useful. Another aspect of this patch is the removal of the LSN groups. There's an explanation of the LSN groups in src/backend/access/transam/README, and while this patch removes the LSN group feature, it doesn't update that text. That's already a problem which needs fixed, but the text says : In fact, we store more than one LSN for each clog page. This relates to : the way we set transaction status hint bits during visibility tests. : We must not set a transaction-committed hint bit on a relation page and : have that record make it to disk prior to the WAL record of the commit. : Since visibility tests are normally made while holding buffer share locks, : we do not have the option of changing the page's LSN to guarantee WAL : synchronization. Instead, we defer the setting of the hint bit if we have : not yet flushed WAL as far as the LSN associated with the transaction. : This requires tracking the LSN of each unflushed async commit. : It is convenient to associate this data with clog buffers: because we : will flush WAL before writing a clog page, we know that we do not need : to remember a transaction's LSN longer than the clog page holding its : commit status remains in memory. However, the naive approach of storing : an LSN for each clog position is unattractive: the LSNs are 32x bigger : than the two-bit commit status fields, and so we'd need 256K of : additional shared memory for each 8K clog buffer page. We choose : instead to store a smaller number of LSNs per page, where each LSN is : the highest LSN associated with any transaction commit in a contiguous : range of transaction IDs on that page. This saves storage at the price : of some possibly-unnecessary delay in setting transaction hint bits. In the new code we effectively store only one LSN per page, which I understand is strictly worse. Maybe the idea of doing away with these LSN groups should be reconsidered ... unless I completely misunderstand the whole thing. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Fri, 2024-03-08 at 13:58 +0100, Alvaro Herrera wrote: > In the new code we effectively store only one LSN per page, which I > understand is strictly worse. To quote from the README: "Instead, we defer the setting of the hint bit if we have not yet flushed WAL as far as the LSN associated with the transaction. This requires tracking the LSN of each unflushed async commit." In other words, the problem the group_lsns are solving is that we can't set the hint bit on a tuple until the commit record for that transaction has actually been flushed. For ordinary sync commit, that's fine, because the CLOG bit isn't set until after the commit record is flushed. But for async commit, the CLOG may be updated before the WAL is flushed, and group_lsns are one way to track the right information to hold off updating the hint bits. "It is convenient to associate this data with clog buffers: because we will flush WAL before writing a clog page, we know that we do not need to remember a transaction's LSN longer than the clog page holding its commit status remains in memory." It's not clear to me that it is so convenient, if it's preventing the SLRU from fitting in with the rest of the system architecture. "The worst case is where a sync-commit xact shares a cached LSN with an async-commit xact that commits a bit later; even though we paid to sync the first xact to disk, we won't be able to hint its outputs until the second xact is sync'd, up to three walwriter cycles later." Perhaps we can revisit alternatives to the group_lsn? If we accept Yong's proposal, and the SLRU has a normal LSN and was used in the normal way, we would just need some kind of secondary structure to hold a mapping from XID->LSN only for async transactions. The characteristics we are looking for in this secondary structure are: 1. cheap to test if it's empty, so it doesn't interfere with a purely sync workload at all 2. expire old entries (where the LSN has already been flushed) cheaply enough so the data structure doesn't bloat 3. look up an LSN given an XID cheaply enough that it doesn't interfere with setting hint bits Making a better secondary structure seems doable to me. Just to brainstorm: * Have an open-addressed hash table mapping async XIDs to their commit LSN. If you have a hash collision, opportunistically see if the entry is old and can be removed. Try K probes, and if they are all recent, then you need to XLogFlush. The table could get pretty big, because it needs to hold enough async transactions for a wal writer cycle or two, but it seems reasonable to make async workloads pay that memory cost. * Another idea, if the size of the structure is a problem, is to group K async xids into a bloom filter that points at a single LSN. When more transactions come along, create another bloom filter for the next K async xids. This could interfere with setting hint bits for sync xids if the bloom filters are undersized, but that doesn't seem like a big problem. Regards, Jeff Davis
On Fri, 2024-03-08 at 12:39 -0800, Jeff Davis wrote: > Making a better secondary structure seems doable to me. Just to > brainstorm: We can also keep the group_lsns, and then just update both the CLOG page LSN and the group_lsn when setting the transaction status. The former would be used for all of the normal WAL-related stuff, and the latter would be used by TransactionIdGetStatus() to return the more precise LSN for that group. Regards, Jeff Davis
On Wed, 2024-03-06 at 12:01 +0000, Li, Yong wrote: > Rebase the patch against the latest HEAD. The upgrade logic could use more comments explaining what's going on and why. As I understand it, it's a one-time conversion that needs to happen between 16 and 17. Is that right? Was the way CLOG is upgraded already decided in some earlier discussion? Given that the CLOG is append-only and gets truncated occasionally, I wonder whether we can just have some marker that xids before some number are the old CLOG, and xids beyond that number are in the new CLOG. I'm not necessarily suggesting that; just an idea. Regards, Jeff Davis
> On Mar 9, 2024, at 05:22, Jeff Davis <pgsql@j-davis.com> wrote: > > External Email > > On Wed, 2024-03-06 at 12:01 +0000, Li, Yong wrote: >> Rebase the patch against the latest HEAD. > > The upgrade logic could use more comments explaining what's going on > and why. As I understand it, it's a one-time conversion that needs to > happen between 16 and 17. Is that right? > > Regards, > Jeff Davis > > In the new code we effectively store only one LSN per page, which I > understand is strictly worse. Maybe the idea of doing away with these > LSN groups should be reconsidered ... unless I completely misunderstand > the whole thing. > > -- > Álvaro Herrera PostgreSQL Developer — Thanks for the comments on LSN groups and pg_upgrade. I have updated the patch to address both comments: - The clog LSN group has been brought back. Now the page LSN on each clog page is used for honoring the write-ahead rule and it is always the highest LSN of all the LSN groups on the page. The LSN groups are used by TransactionIdGetStatus() as before. - New comments have been added to pg_upgrade to mention the SLRU page header change as the reason for upgrading clog files. Regards, Yong
Attachment
On Mon, 2024-03-11 at 10:01 +0000, Li, Yong wrote: > - The clog LSN group has been brought back. > Now the page LSN on each clog page is used for honoring the write- > ahead rule > and it is always the highest LSN of all the LSN groups on the page. > The LSN groups are used by TransactionIdGetStatus() as before. I like where this is going. Álvaro, do you still see a problem with this approach? > - New comments have been added to pg_upgrade to mention the SLRU > page header change as the reason for upgrading clog files. That seems reasonable, but were any alternatives discussed? Do we have consensus that this is the right thing to do? And if we use this approach, is there extra validation or testing that can be done? Regards, Jeff Davis
>> - New comments have been added to pg_upgrade to mention the SLRU >> page header change as the reason for upgrading clog files. > > That seems reasonable, but were any alternatives discussed? Do we have > consensus that this is the right thing to do? In general, there are two approaches. Either we convert the existing clog files, or we don’t. The patch chooses to convert. If we don’t, then the clog file code must be able to handle both formats. For, XIDs in the range where the clog is written in the old format, segment and offset computation must be done in one way, and for XIDs in a different range, it must be computed in a different way. To avoid changing the format in the middle of a page, which must not happen, the new format must start from a clean page, possibly in a clean new segment. If the database is extremely small and has only a few transactions on the first page of clog, then we must either convert the whole page (effectively the whole clog file), or we must skip the rest of the XIDs on the page and ask the database to start from XIDs on the second page on restart. Also, we need to consider where to store the cut-off XID and when to remove it. All these details feel very complex and error prone to me. Performing a one-time conversion is the most efficient and straightforward approach to me. > > And if we use this approach, is there extra validation or testing that > can be done? > > Regards, > Jeff Davis Unfortunately, the test requires a setup of two different versions of PG. I am not aware of an existing test infrastructure which can run automated tests using two PGs. I did manually verify the output of pg_upgrade. Regards, Yong
Hi, On Tue, Mar 19, 2024 at 06:48:33AM +0000, Li, Yong wrote: > > Unfortunately, the test requires a setup of two different versions of PG. I am not > aware of an existing test infrastructure which can run automated tests using two > PGs. I did manually verify the output of pg_upgrade. I think there is something in t/002_pg_upgrade.pl (see src/bin/pg_upgrade/TESTING), that could be used to run automated tests using an old and a current version. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 10, 2024 at 07:19:56AM +0000, Bertrand Drouvot wrote: > On Tue, Mar 19, 2024 at 06:48:33AM +0000, Li, Yong wrote: >> Unfortunately, the test requires a setup of two different versions of PG. I am not >> aware of an existing test infrastructure which can run automated tests using two >> PGs. I did manually verify the output of pg_upgrade. > > I think there is something in t/002_pg_upgrade.pl (see src/bin/pg_upgrade/TESTING), > that could be used to run automated tests using an old and a current version. Cluster.pm relies on install_path for stuff, where it is possible to create tests with multiple nodes pointing to different installation paths. This allows mixing nodes with different build options, or just different major versions like pg_upgrade's perl tests. -- Michael
Attachment
> On Jun 10, 2024, at 16:01, Michael Paquier <michael@paquier.xyz> wrote: > > External Email > > From: Michael Paquier <michael@paquier.xyz> > Subject: Re: Proposal to add page headers to SLRU pages > Date: June 10, 2024 at 16:01:50 GMT+8 > To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > Cc: "Li, Yong" <yoli@ebay.com>, Jeff Davis <pgsql@j-davis.com>, Aleksander Alekseev <aleksander@timescale.com>, PostgreSQLHackers <pgsql-hackers@lists.postgresql.org>, "Bagga, Rishu" <bagrishu@amazon.com>, Robert Haas <robertmhaas@gmail.com>,Andrey Borodin <x4mmm@yandex-team.ru>, "Shyrabokau, Anton" <antons@ebay.com> > > > On Mon, Jun 10, 2024 at 07:19:56AM +0000, Bertrand Drouvot wrote: >> On Tue, Mar 19, 2024 at 06:48:33AM +0000, Li, Yong wrote: >>> Unfortunately, the test requires a setup of two different versions of PG. I am not >>> aware of an existing test infrastructure which can run automated tests using two >>> PGs. I did manually verify the output of pg_upgrade. >> >> I think there is something in t/002_pg_upgrade.pl (see src/bin/pg_upgrade/TESTING), >> that could be used to run automated tests using an old and a current version. > > Cluster.pm relies on install_path for stuff, where it is possible to > create tests with multiple nodes pointing to different installation > paths. This allows mixing nodes with different build options, or just > different major versions like pg_upgrade's perl tests. > — > Michael > > Thanks for pointing this out. Here is what I have tried: 1. Manually build and install PostgreSQL from the latest source code. 2. Following the instructions from src/bin/pg_upgrade to manually dump the regression database. 3. Apply the patch to the latest code, and build from the source. 4. Run “make check” by following the instructions from src/bin/pg_upgrade and setting up the olddump and oldinstall to pointto the “old” installation used in step 2. All tests pass. Yong
> Thanks for pointing this out. Here is what I have tried: > 1. Manually build and install PostgreSQL from the latest source code. > 2. Following the instructions from src/bin/pg_upgrade to manually dump the regression > database. > 3. Apply the patch to the latest code, and build from the source. > 4. Run “make check” by following the instructions from src/bin/pg_upgrade and setting up > the olddump and oldinstall to point to the “old” installation used in step 2. > All tests pass. Hi all, Following up on this. What remaining work do we need to do to get this in? Thanks, Rishu Bagga On Sat, Nov 9, 2024 at 8:50 AM Li, Yong <yoli@ebay.com> wrote: > > > > > On Jun 10, 2024, at 16:01, Michael Paquier <michael@paquier.xyz> wrote: > > > > External Email > > > > From: Michael Paquier <michael@paquier.xyz> > > Subject: Re: Proposal to add page headers to SLRU pages > > Date: June 10, 2024 at 16:01:50 GMT+8 > > To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > > Cc: "Li, Yong" <yoli@ebay.com>, Jeff Davis <pgsql@j-davis.com>, Aleksander Alekseev <aleksander@timescale.com>, PostgreSQLHackers <pgsql-hackers@lists.postgresql.org>, "Bagga, Rishu" <bagrishu@amazon.com>, Robert Haas <robertmhaas@gmail.com>,Andrey Borodin <x4mmm@yandex-team.ru>, "Shyrabokau, Anton" <antons@ebay.com> > > > > > > On Mon, Jun 10, 2024 at 07:19:56AM +0000, Bertrand Drouvot wrote: > >> On Tue, Mar 19, 2024 at 06:48:33AM +0000, Li, Yong wrote: > >>> Unfortunately, the test requires a setup of two different versions of PG. I am not > >>> aware of an existing test infrastructure which can run automated tests using two > >>> PGs. I did manually verify the output of pg_upgrade. > >> > >> I think there is something in t/002_pg_upgrade.pl (see src/bin/pg_upgrade/TESTING), > >> that could be used to run automated tests using an old and a current version. > > > > Cluster.pm relies on install_path for stuff, where it is possible to > > create tests with multiple nodes pointing to different installation > > paths. This allows mixing nodes with different build options, or just > > different major versions like pg_upgrade's perl tests. > > — > > Michael > > > > > > Thanks for pointing this out. Here is what I have tried: > 1. Manually build and install PostgreSQL from the latest source code. > 2. Following the instructions from src/bin/pg_upgrade to manually dump the regression database. > 3. Apply the patch to the latest code, and build from the source. > 4. Run “make check” by following the instructions from src/bin/pg_upgrade and setting up the olddump and oldinstall topoint to the “old” installation used in step 2. > > All tests pass. > > > Yong >