Thread: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax". In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. On master, after 1ccc1e05ae removed the retry loop in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang could no longer happen, but we can still attempt to freeze dead tuples visibly killed before OldestXmin -- resulting in an ERROR. Pruning may fail to remove dead tuples with xmax before OldestXmin if the tuple is not considered removable by GlobalVisState. For vacuum, the GlobalVisState is initially calculated at the beginning of vacuuming the relation -- at the same time and with the same value as VacuumCutoffs->OldestXmin. A backend's GlobalVisState may be updated again when it is accessed if a new snapshot is taken or if something caused ComputeXidHorizons() to be called. This can happen, for example, at the end of a round of index vacuuming when GetOldestNonRemovableTransactionId() is called. Normally this may result in GlobalVisState's horizon moving forward -- potentially allowing more dead tuples to be removed. However, if a disconnected standby with a running transaction older than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum initially calculates GlobalVisState and OldestXmin but before GlobalVisState is updated, the value of GlobalVisState->maybe_needed could go backwards. If this happens in the middle of vacuum's first pruning and freezing pass, it is possible that pruning/freezing could then encounter a tuple whose xmax is younger than GlobalVisState->maybe_needed and older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the heap_pre_freeze_checks() would ERROR out with "cannot freeze committed xmax". This check is to avoid freezing dead tuples. We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue. Attached is the suggested fix for master plus a repro. I wrote it as a recovery suite TAP test, but I am _not_ proposing we add it to the ongoing test suite. It is, amongst other things, definitely prone to flaking. I also had to use loads of data to force two index vacuuming passes now that we have TIDStore, so it is a slow test. If you want to run the repro with meson, you'll have to add 't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with: meson test postgresql:recovery / recovery/099_vacuum_hang If you use autotools, you can run it with: make check PROVE_TESTS="t/099_vacuum_hang.pl" The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. Then vacuum's first pass will continue with pruning and find our later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. I make sure that the standby reconnects between vacuum_get_cutoffs() and pruning because I have a cursor on the page keeping VACUUM FREEZE from getting a cleanup lock. See the repro for step-by-step explanations of how it works. I have a modified version of this that repros the infinite loop on 14-16 with substantially less data. See it here [2]. Also, the repro attached to this mail won't work on 14 and 15 because of changes to background_psql. - Melanie [1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee [2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
Attachment
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > If vacuum fails to remove a tuple with xmax older than > VacuumCutoffs->OldestXmin and younger than > GlobalVisState->maybe_needed, it will ERROR out when determining > whether or not to freeze the tuple with "cannot freeze committed > xmax". > > In back branches starting with 14, failing to remove tuples older than > OldestXmin during pruning caused vacuum to infinitely loop in > lazy_scan_prune(), as investigated on this [1] thread. This is a great summary. > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby > has a transaction that sees that tuple as alive, because it will > simply wait to replay the removal until it would be correct to do so > or recovery conflict handling will cancel the transaction that sees > the tuple as alive and allow replay to continue. I think that this is the right general approach. > The repro forces a round of index vacuuming after the standby > reconnects and before pruning a dead tuple whose xmax is older than > OldestXmin. > > At the end of the round of index vacuuming, _bt_pendingfsm_finalize() > calls GetOldestNonRemovableTransactionId(), thereby updating the > backend's GlobalVisState and moving maybe_needed backwards. Right. I saw details exactly consistent with this when I used GDB against a production instance. I'm glad that you were able to come up with a repro that involves exactly the same basic elements, including index page deletion. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, Melanie! I'm glad to hear you that you have found a root case of the problem) Thank you for that!
Hi, If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax". In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. On master, after 1ccc1e05ae removed the retry loop in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang could no longer happen, but we can still attempt to freeze dead tuples visibly killed before OldestXmin -- resulting in an ERROR. Pruning may fail to remove dead tuples with xmax before OldestXmin if the tuple is not considered removable by GlobalVisState. For vacuum, the GlobalVisState is initially calculated at the beginning of vacuuming the relation -- at the same time and with the same value as VacuumCutoffs->OldestXmin. A backend's GlobalVisState may be updated again when it is accessed if a new snapshot is taken or if something caused ComputeXidHorizons() to be called. This can happen, for example, at the end of a round of index vacuuming when GetOldestNonRemovableTransactionId() is called. Normally this may result in GlobalVisState's horizon moving forward -- potentially allowing more dead tuples to be removed. However, if a disconnected standby with a running transaction older than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum initially calculates GlobalVisState and OldestXmin but before GlobalVisState is updated, the value of GlobalVisState->maybe_needed could go backwards. If this happens in the middle of vacuum's first pruning and freezing pass, it is possible that pruning/freezing could then encounter a tuple whose xmax is younger than GlobalVisState->maybe_needed and older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the heap_pre_freeze_checks() would ERROR out with "cannot freeze committed xmax". This check is to avoid freezing dead tuples. We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue.
This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.
if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
return HEAPTUPLE_DEAD;
Can I keep it?
To be honest, the meson test is new for me, but I see its useful features. I think I will use it for checking my features)Attached is the suggested fix for master plus a repro. I wrote it as a recovery suite TAP test, but I am _not_ proposing we add it to the ongoing test suite. It is, amongst other things, definitely prone to flaking. I also had to use loads of data to force two index vacuuming passes now that we have TIDStore, so it is a slow test. If you want to run the repro with meson, you'll have to add 't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with: meson test postgresql:recovery / recovery/099_vacuum_hang If you use autotools, you can run it with: make check PROVE_TESTS="t/099_vacuum_hang.pl The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. Then vacuum's first pass will continue with pruning and find our later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. I make sure that the standby reconnects between vacuum_get_cutoffs() and pruning because I have a cursor on the page keeping VACUUM FREEZE from getting a cleanup lock. See the repro for step-by-step explanations of how it works. I have a modified version of this that repros the infinite loop on 14-16 with substantially less data. See it here [2]. Also, the repro attached to this mail won't work on 14 and 15 because of changes to background_psql. [1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee [2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 21/06/2024 03:02, Peter Geoghegan wrote: > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: >> If vacuum fails to remove a tuple with xmax older than >> VacuumCutoffs->OldestXmin and younger than >> GlobalVisState->maybe_needed, it will ERROR out when determining >> whether or not to freeze the tuple with "cannot freeze committed >> xmax". >> >> In back branches starting with 14, failing to remove tuples older than >> OldestXmin during pruning caused vacuum to infinitely loop in >> lazy_scan_prune(), as investigated on this [1] thread. > > This is a great summary. +1 >> We can fix this by always removing tuples considered dead before >> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby >> has a transaction that sees that tuple as alive, because it will >> simply wait to replay the removal until it would be correct to do so >> or recovery conflict handling will cancel the transaction that sees >> the tuple as alive and allow replay to continue. > > I think that this is the right general approach. +1 >> The repro forces a round of index vacuuming after the standby >> reconnects and before pruning a dead tuple whose xmax is older than >> OldestXmin. >> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize() >> calls GetOldestNonRemovableTransactionId(), thereby updating the >> backend's GlobalVisState and moving maybe_needed backwards. > > Right. I saw details exactly consistent with this when I used GDB > against a production instance. > > I'm glad that you were able to come up with a repro that involves > exactly the same basic elements, including index page deletion. Would it be possible to make it robust so that we could always run it with "make check"? This seems like an important corner case to regression test. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby > has a transaction that sees that tuple as alive, because it will > simply wait to replay the removal until it would be correct to do so > or recovery conflict handling will cancel the transaction that sees > the tuple as alive and allow replay to continue. > > This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessivefreezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid ofthis by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine. > > if (prstate->cutoffs && > TransactionIdIsValid(prstate->cutoffs->OldestXmin) && > prstate->cutoffs->OldestMxact != FirstMultiXactId && > NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin)) > return HEAPTUPLE_DEAD; > > Can I keep it? This looks like an addition to the new criteria I added to heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so, it looks like it would only return HEAPTUPLE_DEAD (and thus only remove) a subset of the tuples my original criteria would remove. When vacuum calculates OldestMxact as FirstMultiXactId, it would not remove those tuples deleted before OldestXmin. It seems like OldestMxact will equal FirstMultiXactID sometimes right after initdb and after transaction ID wraparound. I'm not sure I totally understand the criteria. One thing I find confusing about this is that this would actually remove less tuples than with my criteria -- which could lead to more freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we would not remove tuples deleted before OldestXmin and thus return HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider freezing them. So, it seems like we would do more freezing by adding this criteria. Could you explain more about how the criteria you are suggesting works? Are you saying it does less freezing than master or less freezing than with my patch? > Attached is the suggested fix for master plus a repro. I wrote it as a > recovery suite TAP test, but I am _not_ proposing we add it to the > ongoing test suite. It is, amongst other things, definitely prone to > flaking. I also had to use loads of data to force two index vacuuming > passes now that we have TIDStore, so it is a slow test. -- snip -- > I have a modified version of this that repros the infinite loop on > 14-16 with substantially less data. See it here [2]. Also, the repro > attached to this mail won't work on 14 and 15 because of changes to > background_psql. > > I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replicawhen I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problembecause the symptoms were the same. Did you get similar behavior on master or on back branches? Was the behavior you observed the infinite loop or the error during heap_prepare_freeze_tuple()? In my examples, the replica is needed because something has to move the horizon on the primary backwards. When a standby reconnects with an older oldest running transaction ID than any of the running transactions on the primary and the vacuuming backend recomputes its RecentXmin, the horizon may move backwards when compared to the horizon calculated at the beginning of the vacuum. Vacuum does not recompute cutoffs->OldestXmin during vacuuming a relation but it may recompute the values in the GlobalVisState it uses for pruning. We knew of only one other way that the horizon could move backwards which Matthias describes here [1]. However, this is thought to be its own concurrency-related bug in the commit-abort path that should be fixed -- as opposed to the standby reconnecting with an older oldest running transaction ID which can be expected. Do you know if you were seeing the effects of the scenario Matthias describes? - Melanie [1] https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 21/06/2024 03:02, Peter Geoghegan wrote: > > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > >> The repro forces a round of index vacuuming after the standby > >> reconnects and before pruning a dead tuple whose xmax is older than > >> OldestXmin. > >> > >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize() > >> calls GetOldestNonRemovableTransactionId(), thereby updating the > >> backend's GlobalVisState and moving maybe_needed backwards. > > > > Right. I saw details exactly consistent with this when I used GDB > > against a production instance. > > > > I'm glad that you were able to come up with a repro that involves > > exactly the same basic elements, including index page deletion. > > Would it be possible to make it robust so that we could always run it > with "make check"? This seems like an important corner case to > regression test. I'd have to look into how to ensure I can stabilize some of the parts that seem prone to flaking. I can probably stabilize the vacuum bit with a query of pg_stat_activity making sure it is waiting to acquire the cleanup lock. I don't, however, see a good way around the large amount of data required to trigger more than one round of index vacuuming. I could generate the data more efficiently than I am doing here (generate_series() in the from clause). Perhaps with a copy? I know it is too slow now to go in an ongoing test, but I don't have an intuition around how fast it would have to be to be acceptable. Is there a set of additional tests that are slower that we don't always run? I didn't follow how the wraparound test ended up, but that seems like one that would have been slow. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 6/24/24 16:53, Melanie Plageman wrote: > On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> On 21/06/2024 03:02, Peter Geoghegan wrote: >>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman >>> <melanieplageman@gmail.com> wrote: >>> >>>> The repro forces a round of index vacuuming after the standby >>>> reconnects and before pruning a dead tuple whose xmax is older than >>>> OldestXmin. >>>> >>>> At the end of the round of index vacuuming, _bt_pendingfsm_finalize() >>>> calls GetOldestNonRemovableTransactionId(), thereby updating the >>>> backend's GlobalVisState and moving maybe_needed backwards. >>> >>> Right. I saw details exactly consistent with this when I used GDB >>> against a production instance. >>> >>> I'm glad that you were able to come up with a repro that involves >>> exactly the same basic elements, including index page deletion. >> >> Would it be possible to make it robust so that we could always run it >> with "make check"? This seems like an important corner case to >> regression test. > > I'd have to look into how to ensure I can stabilize some of the parts > that seem prone to flaking. I can probably stabilize the vacuum bit > with a query of pg_stat_activity making sure it is waiting to acquire > the cleanup lock. > > I don't, however, see a good way around the large amount of data > required to trigger more than one round of index vacuuming. I could > generate the data more efficiently than I am doing here > (generate_series() in the from clause). Perhaps with a copy? I know it > is too slow now to go in an ongoing test, but I don't have an > intuition around how fast it would have to be to be acceptable. Is > there a set of additional tests that are slower that we don't always > run? I didn't follow how the wraparound test ended up, but that seems > like one that would have been slow. > I think it depends on what is the impact on the 'make check' duration. If it could be added to one of the existing test groups, then it depends on how long the slowest test in that group is. If the new test needs to be in a separate group, it probably needs to be very fast. But I was wondering how much time are we talking about, so I tried creating a table, filling it with 300k rows => 250ms creating an index => 180ms delete 90% data => 200ms vacuum t => 130ms which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms. I'm not sure how much more stuff does the test need to do, but this would be pretty reasonable, if we could add it to an existing group. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. I don't have a great feeling about this fix. It's not that I think it's wrong. It's just that the underlying problem here is that we have heap_page_prune_and_freeze() getting both GlobalVisState *vistest and struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge of deciding what gets pruned, but that doesn't actually work, because as I pointed out in http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your fix is to consider both variables, which again may be totally correct, but wouldn't it be a lot better if we didn't have two variables fighting for control of the same behavior? (I'm not trying to be a nuisance here -- I think it's great that you've done the work to pin this down and perhaps there is no better fix than what you've proposed.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't have a great feeling about this fix. It's not that I think > it's wrong. It's just that the underlying problem here is that we have > heap_page_prune_and_freeze() getting both GlobalVisState *vistest and > struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge > of deciding what gets pruned, but that doesn't actually work, because > as I pointed out in > http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com > it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your > fix is to consider both variables, which again may be totally correct, > but wouldn't it be a lot better if we didn't have two variables > fighting for control of the same behavior? Why would it be better? It's to our advantage to have vistest prune away extra tuples when possible. Andres placed a lot of emphasis on that during the snapshot scalability work -- vistest can be updated on the fly. The problem here is that OldestXmin is supposed to be more conservative than vistest, which it almost always is, except in this one edge case. I don't think that plugging that hole changes the basic fact that there is one source of truth about what *needs* to be pruned. There is such a source of truth: OldestXmin. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > The problem here is that OldestXmin is supposed to be more > conservative than vistest, which it almost always is, except in this > one edge case. I don't think that plugging that hole changes the basic > fact that there is one source of truth about what *needs* to be > pruned. There is such a source of truth: OldestXmin. Well, another approach could be to make it so that OldestXmin actually is always more conservative than vistest rather than almost always. I agree with you that letting the pruning horizon move forward during vacuum is desirable. I'm just wondering if having the vacuum code need to know a second horizon is really the best way to address that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The problem here is that OldestXmin is supposed to be more > > conservative than vistest, which it almost always is, except in this > > one edge case. I don't think that plugging that hole changes the basic > > fact that there is one source of truth about what *needs* to be > > pruned. There is such a source of truth: OldestXmin. > > Well, another approach could be to make it so that OldestXmin actually > is always more conservative than vistest rather than almost always. If we did things like that then it would still be necessary to write a patch like the one Melanie came up with, on the grounds that we'd really need to be paranoid about having missed some subtlety. We might as well just rely on the mechanism directly. I just don't think that it makes much difference. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The problem here is that OldestXmin is supposed to be more > > conservative than vistest, which it almost always is, except in this > > one edge case. I don't think that plugging that hole changes the basic > > fact that there is one source of truth about what *needs* to be > > pruned. There is such a source of truth: OldestXmin. > > Well, another approach could be to make it so that OldestXmin actually > is always more conservative than vistest rather than almost always. For the purposes of pruning, we are effectively always using the more conservative of the two with this patch. Are you more concerned about having a single horizon for pruning or about having a horizon that does not move backwards after being established at the beginning of vacuuming the relation? Right now, in master, we do use a single horizon when determining what is pruned -- that from GlobalVisState. OldestXmin is only used for freezing and full page visibility determinations. Using a different horizon for pruning by vacuum than freezing is what is causing the error on master. > I agree with you that letting the pruning horizon move forward during > vacuum is desirable. I'm just wondering if having the vacuum code need > to know a second horizon is really the best way to address that. I was thinking about this some more and I realized I don't really get why we think using GlobalVisState for pruning will let us remove more tuples in the common case. I had always thought it was because the vacuuming backend's GlobalVisState will get updated periodically throughout vacuum and so, assuming the oldest running transaction changes, our horizon for vacuum would change. But, in writing this repro, it is actually quite hard to get GlobalVisState to update. Our backend's RecentXmin needs to have changed. And there aren't very many places where we take a new snapshot after starting to vacuum a relation. One of those is at the end of index vacuuming, but that can only affect the pruning horizon if we have to do multiple rounds of index vacuuming. Is that really the case we are thinking of when we say we want the pruning horizon to move forward during vacuum? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I had always thought it was because the vacuuming backend's > GlobalVisState will get updated periodically throughout vacuum and so, > assuming the oldest running transaction changes, our horizon for > vacuum would change. I believe that it's more of an aspirational thing at this point. That it is currently aspirational (it happens to some extent but isn't ever particularly useful) shouldn't change the analysis about how to fix this bug. > One of those is at the > end of index vacuuming, but that can only affect the pruning horizon > if we have to do multiple rounds of index vacuuming. Is that really > the case we are thinking of when we say we want the pruning horizon to > move forward during vacuum? No, that's definitely not what we were thinking of. It's just an accident that it's almost the only thing that'll do that. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > If vacuum fails to remove a tuple with xmax older than > VacuumCutoffs->OldestXmin and younger than > GlobalVisState->maybe_needed, it will ERROR out when determining > whether or not to freeze the tuple with "cannot freeze committed > xmax". One thing I don't understand is why it is okay to freeze the xmax of a dead tuple just because it is from an aborted update. heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD tuples with normal xmaxes (non-multis) so that it can freeze tuples from aborted updates. The only case in which we freeze dead tuples with a non-multi xmax is if the xmax is from before OldestXmin and is also not committed (so from an aborted update). Freezing dead tuples replaces their xmax with InvalidTransactionId -- which would make them look alive. So, it makes sense we don't do this for dead tuples in the common case. But why is it 1) okay and 2) desirable to freeze xmaxes of tuples from aborted updates? Won't it make them look alive again? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Are you more concerned about having a single horizon for pruning or > about having a horizon that does not move backwards after being > established at the beginning of vacuuming the relation? I'm not sure I understand. The most important thing here is fixing the bug. But if we have a choice of how to fix the bug, I'd prefer to do it by having the pruning code test one horizon that is always correct, rather than (as I think the patch does) having it test against two horizons because as a way of covering possible discrepancies between those values. > Right now, in master, we do use a single horizon when determining what > is pruned -- that from GlobalVisState. OldestXmin is only used for > freezing and full page visibility determinations. Using a different > horizon for pruning by vacuum than freezing is what is causing the > error on master. Agreed. > I had always thought it was because the vacuuming backend's > GlobalVisState will get updated periodically throughout vacuum and so, > assuming the oldest running transaction changes, our horizon for > vacuum would change. But, in writing this repro, it is actually quite > hard to get GlobalVisState to update. Our backend's RecentXmin needs > to have changed. And there aren't very many places where we take a new > snapshot after starting to vacuum a relation. One of those is at the > end of index vacuuming, but that can only affect the pruning horizon > if we have to do multiple rounds of index vacuuming. Is that really > the case we are thinking of when we say we want the pruning horizon to > move forward during vacuum? I thought the idea was that the GlobalVisTest stuff would force a recalculation now and then, but maybe it doesn't actually do that? Suppose process A begins a transaction, acquires an XID, and then goes idle. Process B now begins a giant vacuum. At some point in the middle of the vacuum, A ends the transaction. Are you saying that B's GlobalVisTest never really notices that this has happened? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > One thing I don't understand is why it is okay to freeze the xmax of a > dead tuple just because it is from an aborted update. We don't do that with XID-based xmaxs. Though perhaps we should, since we'll already prune-away the successor tuple, and so might as well go one tiny step further and clear the xmax for the original tuple via freezing/setting it InvalidTransactionId. Instead we just leave the original tuple largely undisturbed, with its original xmax. We do something like that with Multi-based xmax fields, though not with the specific goal of cleaning up after aborts in mind (we can also remove lockers that are no longer running, regardless of where they are relative to OldestXmin, stuff like that). The actual goal with that is to enforce MultiXactCutoff, independently of whether or not their member XIDs are < FreezeLimit yet. > The only case in which we freeze dead tuples > with a non-multi xmax is if the xmax is from before OldestXmin and is > also not committed (so from an aborted update). Perhaps I misunderstand, but: we simply don't freeze DEAD (not RECENTLY_DEAD) tuples in the first place, because we don't have to (pruning removes them instead). It doesn't matter if they're DEAD due to being from aborted transactions or DEAD due to being deleted/updated by a transaction that committed (committed and < OldestXmin). The freezing related code paths in heapam.c don't particularly care whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin. Just as long as it's not fully DEAD (then it should have been pruned). -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure I understand. The most important thing here is fixing the > bug. But if we have a choice of how to fix the bug, I'd prefer to do > it by having the pruning code test one horizon that is always correct, > rather than (as I think the patch does) having it test against two > horizons because as a way of covering possible discrepancies between > those values. Your characterizing of OldestXmin + vistest as two horizons seems pretty arbitrary to me. I know what you mean, of course, but it seems like a distinction without a difference. > I thought the idea was that the GlobalVisTest stuff would force a > recalculation now and then, but maybe it doesn't actually do that? It definitely can do that. Just not in a way that meaningfully increases the number of heap tuples that we can recognize as DEAD and remove. At least not currently. > Suppose process A begins a transaction, acquires an XID, and then goes > idle. Process B now begins a giant vacuum. At some point in the middle > of the vacuum, A ends the transaction. Are you saying that B's > GlobalVisTest never really notices that this has happened? That's my understanding, yes. That is, vistest is approximately the same thing as OldestXmin anyway. At least for now. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > One thing I don't understand is why it is okay to freeze the xmax of a > > dead tuple just because it is from an aborted update. > > We don't do that with XID-based xmaxs. Though perhaps we should, since > we'll already prune-away the successor tuple, and so might as well go > one tiny step further and clear the xmax for the original tuple via > freezing/setting it InvalidTransactionId. Instead we just leave the > original tuple largely undisturbed, with its original xmax. I thought that was the case too, but we call heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then else if (TransactionIdIsNormal(xid)) { /* Raw xmax is normal XID */ freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin); } And then later we if (freeze_xmax) frz->xmax = InvalidTransactionId; and then when we execute freezing the tuple in heap_execute_freeze_tuple() HeapTupleHeaderSetXmax(tuple, frz->xmax); Which sets the xmax to InvalidTransactionId. Or am I missing something? > > The only case in which we freeze dead tuples > > with a non-multi xmax is if the xmax is from before OldestXmin and is > > also not committed (so from an aborted update). > > Perhaps I misunderstand, but: we simply don't freeze DEAD (not > RECENTLY_DEAD) tuples in the first place, because we don't have to > (pruning removes them instead). It doesn't matter if they're DEAD due > to being from aborted transactions or DEAD due to being > deleted/updated by a transaction that committed (committed and < > OldestXmin). Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples. HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax of a tuple that has been deleted or updated by a transaction that committed with InvalidTransactionId. And it seems like the code does that? Why even call heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact freezing? > The freezing related code paths in heapam.c don't particularly care > whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin. > Just as long as it's not fully DEAD (then it should have been pruned). But it just seems like we shouldn't freeze RECENTLY_DEAD either. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I thought the idea was that the GlobalVisTest stuff would force a > > recalculation now and then, but maybe it doesn't actually do that? > > It definitely can do that. Just not in a way that meaningfully > increases the number of heap tuples that we can recognize as DEAD and > remove. At least not currently. > > > Suppose process A begins a transaction, acquires an XID, and then goes > > idle. Process B now begins a giant vacuum. At some point in the middle > > of the vacuum, A ends the transaction. Are you saying that B's > > GlobalVisTest never really notices that this has happened? > > That's my understanding, yes. That is, vistest is approximately the > same thing as OldestXmin anyway. At least for now. Exactly. Something has to cause this backend to update its view of the horizon. At the end of index vacuuming, GetOldestNonRemovableTransactionId() will explicitly ComputeXidHorizons() which will update our backend's GlobalVisStates. Otherwise, if our backend's RecentXmin is updated, by taking a new snapshot, then we may update our GlobalVisStates. See GlobalVisTestShouldUpdate() for the conditions under which we would update our GlobalVisStates during the normal visibility checks happening during pruning. Vacuum used to open indexes after calculating horizons before starting its first pass. This led to a recomputation of the horizon. But, in master, there aren't many obvious places where such a thing would be happening. - Melanie
On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote: > On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I'm not sure I understand. The most important thing here is fixing the > > bug. But if we have a choice of how to fix the bug, I'd prefer to do > > it by having the pruning code test one horizon that is always correct, > > rather than (as I think the patch does) having it test against two > > horizons because as a way of covering possible discrepancies between > > those values. > > Your characterizing of OldestXmin + vistest as two horizons seems > pretty arbitrary to me. I know what you mean, of course, but it seems > like a distinction without a difference. "Two horizons" matches how I model it. If the two were _always_ indicating the same notion of visibility, we wouldn't have this thread. On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote: > Right now, in master, we do use a single horizon when determining what > is pruned -- that from GlobalVisState. OldestXmin is only used for > freezing and full page visibility determinations. Using a different > horizon for pruning by vacuum than freezing is what is causing the > error on master. Agreed, and I think using different sources for pruning and freezing is a recipe for future bugs. Fundamentally, both are about answering "is snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?" That's not to say this thread shall unify the two, but I suspect that's the right long-term direction.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote: > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote: > > Right now, in master, we do use a single horizon when determining what > > is pruned -- that from GlobalVisState. OldestXmin is only used for > > freezing and full page visibility determinations. Using a different > > horizon for pruning by vacuum than freezing is what is causing the > > error on master. > > Agreed, and I think using different sources for pruning and freezing is a > recipe for future bugs. Fundamentally, both are about answering "is > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?" > That's not to say this thread shall unify the two, but I suspect that's the > right long-term direction. What does it really mean to unify the two, though? If the OldestXmin field was located in struct GlobalVisState (next to definitely_needed and maybe_needed), but everything worked in essentially the same way as it will with Melanie's patch in place, would that count as unifying the two? Why or why not? -- Peter Geoghegan
On Mon, Jun 24, 2024 at 09:49:53PM -0400, Peter Geoghegan wrote: > On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote: > > > Right now, in master, we do use a single horizon when determining what > > > is pruned -- that from GlobalVisState. OldestXmin is only used for > > > freezing and full page visibility determinations. Using a different > > > horizon for pruning by vacuum than freezing is what is causing the > > > error on master. > > > > Agreed, and I think using different sources for pruning and freezing is a > > recipe for future bugs. Fundamentally, both are about answering "is > > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?" > > That's not to say this thread shall unify the two, but I suspect that's the > > right long-term direction. > > What does it really mean to unify the two, though? > > If the OldestXmin field was located in struct GlobalVisState (next to > definitely_needed and maybe_needed), but everything worked in > essentially the same way as it will with Melanie's patch in place, > would that count as unifying the two? Why or why not? To me, no, unification would mean removing the data redundancy. Relocating the redundant field and/or code that updates the redundant field certainly could reduce the risk of bugs, so I'm not opposing everything short of removing the data redundancy. I'm just agreeing with the "prefer" from https://postgr.es/m/CA+TgmoYzS_bkt_MrNxr5QrXDKfedmh4tStn8UBTTBXqv=3JTew@mail.gmail.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, On 2024-06-24 16:35:50 -0400, Robert Haas wrote: > On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Are you more concerned about having a single horizon for pruning or > > about having a horizon that does not move backwards after being > > established at the beginning of vacuuming the relation? > > I'm not sure I understand. The most important thing here is fixing the > bug. But if we have a choice of how to fix the bug, I'd prefer to do > it by having the pruning code test one horizon that is always correct, > rather than (as I think the patch does) having it test against two > horizons because as a way of covering possible discrepancies between > those values. I think that's going in the wrong direction. We *want* to prune more aggressively if we can (*), the necessary state is represented by the vistest. That's a different thing than *having* to prune tuples beyond a certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem we're having here is that the two states can get out of sync due to the vistest "moving backwards", because of hot_standby_feedback (and perhaps also an issue around aborts). To prevent that we can a) make sure that we always take the hard cutoff into account b) prevent vistest from going backwards (*) we really ought to become more aggressive, by removing intermediary row versions when they're not visible to anyone, but not yet old enough to be below the horizon. But that realistically will only be possible in *some* cases, e.g. when the predecessor row version is on the same page. > > I had always thought it was because the vacuuming backend's > > GlobalVisState will get updated periodically throughout vacuum and so, > > assuming the oldest running transaction changes, our horizon for > > vacuum would change. But, in writing this repro, it is actually quite > > hard to get GlobalVisState to update. Our backend's RecentXmin needs > > to have changed. And there aren't very many places where we take a new > > snapshot after starting to vacuum a relation. One of those is at the > > end of index vacuuming, but that can only affect the pruning horizon > > if we have to do multiple rounds of index vacuuming. Is that really > > the case we are thinking of when we say we want the pruning horizon to > > move forward during vacuum? > > I thought the idea was that the GlobalVisTest stuff would force a > recalculation now and then, but maybe it doesn't actually do that? It forces an accurate horizon to be determined the first time it would require it to determine visibility. The "first time" is determined by RecentXmin not having changed. The main goal of the vistest stuff was to not have to determine an accurate horizon in GetSnapshotData(). Determining an accurate horizon requires accessing each backends ->xmin, which causes things to scale badly, as ->xmin changes so frequently. The cost of determining the accurate horizon is irrelevant for vacuums, but it's not at all irrelevant for on-access pruning. > Suppose process A begins a transaction, acquires an XID, and then goes > idle. Process B now begins a giant vacuum. At some point in the middle > of the vacuum, A ends the transaction. Are you saying that B's > GlobalVisTest never really notices that this has happened? Not at the moment, but we should add heuristics like that. Greetings, Andres Freund
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote: > I think that's going in the wrong direction. We *want* to prune more > aggressively if we can (*), the necessary state is represented by the > vistest. That's a different thing than *having* to prune tuples beyond a > certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem > we're having here is that the two states can get out of sync due to the > vistest "moving backwards", because of hot_standby_feedback (and perhaps also > an issue around aborts). I agree that we want to prune more aggressively if we can. I think that fixing this by preventing vistest from going backward is reasonable, and I like it better than what Melanie proposed, although I like what Melanie proposed much better than not fixing it! I'm not sure how to do that cleanly, but one of you may have an idea. I do think that having a bunch of different XID values that function as horizons and a vistest object that holds some more XID horizons floating around in vacuum makes the code hard to understand. The relationships between the various values are not well-documented. For instance, the vistest has to be after vacrel->cutoffs.OldestXmin for correctness, but I don't think there's a single comment anywhere saying that; meanwhile, the comments for VacuumCutoffs say "OldestXmin is the Xid below which tuples deleted by any xact (that committed) should be considered DEAD, not just RECENTLY_DEAD." Surely the reader can be forgiven for thinking that this is the cutoff that will actually be used by pruning, but it isn't. And more generally, it seems like a fairly big problem to me that LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also points to a GlobalVisState object that contains definitely_needed and maybe_needed. That is six different XID cutoffs for one vacuum operation. That's a lot. I can't describe how they're all different from each other or what the necessary relationships between them are off-hand, and I bet nobody else could either, at least until recently, else we might not have this bug. I feel like if it were possible to have fewer of them and still have things work, we'd be better off. I'm not sure that's doable. But six seems like a lot. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-06-25 08:42:02 -0400, Robert Haas wrote: > On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote: > > I think that's going in the wrong direction. We *want* to prune more > > aggressively if we can (*), the necessary state is represented by the > > vistest. That's a different thing than *having* to prune tuples beyond a > > certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem > > we're having here is that the two states can get out of sync due to the > > vistest "moving backwards", because of hot_standby_feedback (and perhaps also > > an issue around aborts). > > I agree that we want to prune more aggressively if we can. I think > that fixing this by preventing vistest from going backward is > reasonable, and I like it better than what Melanie proposed, although > I like what Melanie proposed much better than not fixing it! I'm not > sure how to do that cleanly, but one of you may have an idea. It's not hard - but it has downsides. It'll mean that - outside of vacuum - we'll much more often not react to horizons going backwards due to hot_standby_feedback. Which means that hot_standby_feedback, when used without slots, will prevent fewer conflicts. > I do think that having a bunch of different XID values that function > as horizons and a vistest object that holds some more XID horizons > floating around in vacuum makes the code hard to understand. The > relationships between the various values are not well-documented. For > instance, the vistest has to be after vacrel->cutoffs.OldestXmin for > correctness, but I don't think there's a single comment anywhere > saying that; It is somewhat documented: * Note: the approximate horizons (see definition of GlobalVisState) are * updated by the computations done here. That's currently required for * correctness and a small optimization. Without doing so it's possible that * heap vacuum's call to heap_page_prune_and_freeze() uses a more conservative * horizon than later when deciding which tuples can be removed - which the * code doesn't expect (breaking HOT). > And more generally, it seems like a fairly big problem to me that > LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs > object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also > points to a GlobalVisState object that contains definitely_needed and > maybe_needed. That is six different XID cutoffs for one vacuum > operation. That's a lot. I can't describe how they're all different > from each other or what the necessary relationships between them are > off-hand, and I bet nobody else could either, at least until recently, > else we might not have this bug. I feel like if it were possible to > have fewer of them and still have things work, we'd be better off. I'm > not sure that's doable. But six seems like a lot. Agreed. I don't think you can just unify things though, they actually are all different for good, or at least decent, reasons. I think improving the naming alone could help a good bit though. Greetings, Andres Freund
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote: > It's not hard - but it has downsides. It'll mean that - outside of vacuum - > we'll much more often not react to horizons going backwards due to > hot_standby_feedback. Which means that hot_standby_feedback, when used without > slots, will prevent fewer conflicts. Can you explain this in more detail? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue. This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine. if (prstate->cutoffs && TransactionIdIsValid(prstate->cutoffs->OldestXmin) && prstate->cutoffs->OldestMxact != FirstMultiXactId && NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin)) return HEAPTUPLE_DEAD; Can I keep it?This looks like an addition to the new criteria I added to heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so, it looks like it would only return HEAPTUPLE_DEAD (and thus only remove) a subset of the tuples my original criteria would remove. When vacuum calculates OldestMxact as FirstMultiXactId, it would not remove those tuples deleted before OldestXmin. It seems like OldestMxact will equal FirstMultiXactID sometimes right after initdb and after transaction ID wraparound. I'm not sure I totally understand the criteria. One thing I find confusing about this is that this would actually remove less tuples than with my criteria -- which could lead to more freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we would not remove tuples deleted before OldestXmin and thus return HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider freezing them. So, it seems like we would do more freezing by adding this criteria. Could you explain more about how the criteria you are suggesting works? Are you saying it does less freezing than master or less freezing than with my patch?
At first, I noticed that with this patch, vacuum fouls the nodes more often than before, and it seemed to me that more time was spent initializing the cluster with this patch than before, so I suggested considering this condition. After checking again, I found that the problem was with my laptop. So, sorry for the noise.
I'm sorry, I need a little more time to figure this out. I will answer this question later.Attached is the suggested fix for master plus a repro. I wrote it as a recovery suite TAP test, but I am _not_ proposing we add it to the ongoing test suite. It is, amongst other things, definitely prone to flaking. I also had to use loads of data to force two index vacuuming passes now that we have TIDStore, so it is a slow test.-- snip --I have a modified version of this that repros the infinite loop on 14-16 with substantially less data. See it here [2]. Also, the repro attached to this mail won't work on 14 and 15 because of changes to background_psql. I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.Did you get similar behavior on master or on back branches? Was the behavior you observed the infinite loop or the error during heap_prepare_freeze_tuple()? In my examples, the replica is needed because something has to move the horizon on the primary backwards. When a standby reconnects with an older oldest running transaction ID than any of the running transactions on the primary and the vacuuming backend recomputes its RecentXmin, the horizon may move backwards when compared to the horizon calculated at the beginning of the vacuum. Vacuum does not recompute cutoffs->OldestXmin during vacuuming a relation but it may recompute the values in the GlobalVisState it uses for pruning. We knew of only one other way that the horizon could move backwards which Matthias describes here [1]. However, this is thought to be its own concurrency-related bug in the commit-abort path that should be fixed -- as opposed to the standby reconnecting with an older oldest running transaction ID which can be expected. Do you know if you were seeing the effects of the scenario Matthias describes? [1] https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote: > > It's not hard - but it has downsides. It'll mean that - outside of vacuum - > > we'll much more often not react to horizons going backwards due to > > hot_standby_feedback. Which means that hot_standby_feedback, when used without > > slots, will prevent fewer conflicts. > > Can you explain this in more detail? If we prevent GlobalVisState from moving backward, then we would less frequently be pushing the horizon backward on the primary in response to hot standby feedback. Then, the primary would do more things that would not be safely replayable on the standby -- so the standby could end up encountering more recovery conflicts. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote: > > > It's not hard - but it has downsides. It'll mean that - outside of vacuum - > > > we'll much more often not react to horizons going backwards due to > > > hot_standby_feedback. Which means that hot_standby_feedback, when used without > > > slots, will prevent fewer conflicts. > > > > Can you explain this in more detail? > > If we prevent GlobalVisState from moving backward, then we would less > frequently be pushing the horizon backward on the primary in response > to hot standby feedback. Then, the primary would do more things that > would not be safely replayable on the standby -- so the standby could > end up encountering more recovery conflicts. I don't get it. hot_standby_feedback only moves horizons backward on the primary, AFAIK, when it first connects, or when it reconnects. Which I guess could be frequent for some users with flaky networks, but does that really rise to the level of "much more often"? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, On 2024-06-25 12:31:11 -0400, Robert Haas wrote: > On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote: > > > > It's not hard - but it has downsides. It'll mean that - outside of vacuum - > > > > we'll much more often not react to horizons going backwards due to > > > > hot_standby_feedback. Which means that hot_standby_feedback, when used without > > > > slots, will prevent fewer conflicts. > > > > > > Can you explain this in more detail? > > > > If we prevent GlobalVisState from moving backward, then we would less > > frequently be pushing the horizon backward on the primary in response > > to hot standby feedback. Then, the primary would do more things that > > would not be safely replayable on the standby -- so the standby could > > end up encountering more recovery conflicts. > > I don't get it. hot_standby_feedback only moves horizons backward on > the primary, AFAIK, when it first connects, or when it reconnects. > Which I guess could be frequent for some users with flaky networks, > but does that really rise to the level of "much more often"? Well, the thing is that with the "prevent it from going backwards" approach, once the horizon is set to something recent in a backend, it's "sticky". If a replica is a bit behind or if there's a long-lived snapshot and disconnects, the vistest state will advance beyond where the replica needs it to be. Even if the standby later reconnects, the vistest in long-lived sessions will still have the more advanced state. So all future pruning these backends do runs into the risk of performing pruning that removes rows the standby still deems visible and thus causes recovery conflicts. I.e. you don't even need frequent disconnects, you just need one disconnect and sessions that aren't shortlived. That said, obviously there will be plenty setups where this won't cause an issue. I don't really have a handle on how often it'd be a problem. Greetings, Andres Freund
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 1:10 PM Andres Freund <andres@anarazel.de> wrote: > That said, obviously there will be plenty setups where this won't cause an > issue. I don't really have a handle on how often it'd be a problem. Fair enough. Even if it's not super-common, it doesn't seem like a great idea to regress such scenarios in the back-branches. Is there any way that we could instead tweak things so that we adjust the visibility test object itself? Like can have a GlobalVisTest API where we can supply the OldestXmin from the VacuumCutoffs and have it ... do something useful with that? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-06-25 14:35:00 -0400, Robert Haas wrote: > Is there any way that we could instead tweak things so that we adjust > the visibility test object itself? Like can have a GlobalVisTest API > where we can supply the OldestXmin from the VacuumCutoffs and have it > ... do something useful with that? I doubt that's doable in the back branches. And even on HEAD, I don't think it's a particularly attractive option - there's just a global vistest for each of the types of objects with a specific horizon (they need to be updated occasionally, e.g. when taking snapshots). So there's not really a spot to put an associated OldestXmin. We could put it there and remove it at the end of vacuum / in an exception handler, but that seems substantially worse.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jun 25, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote: > I doubt that's doable in the back branches. And even on HEAD, I don't think > it's a particularly attractive option - there's just a global vistest for each > of the types of objects with a specific horizon (they need to be updated > occasionally, e.g. when taking snapshots). So there's not really a spot to put > an associated OldestXmin. We could put it there and remove it at the end of > vacuum / in an exception handler, but that seems substantially worse. Oh, right: I forgot that the visibility test objects were just pointers to global variables. Well, I don't know. I guess that doesn't leave any real options but to fix it as Melanie proposed. But I still don't like it very much. I feel like having to test against two different thresholds in the pruning code is surely a sign that we're doing something wrong. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Would it be possible to make it robust so that we could always run it > with "make check"? This seems like an important corner case to > regression test. Okay, I've attached a new version of the patch and a new version of the repro that may be fast and stable enough to commit. It is more minimal than the previous version. I made the table as small as I could to still trigger two rounds of index vacuuming. I tried to make it as stable as possible. I also removed the cursor on the standby that could trigger recovery conflicts. It would be super helpful if someone could take a look at the test and point out any ways I could make it even more likely to be stable. - Melanie
Attachment
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jul 2, 2024 at 7:07 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > Would it be possible to make it robust so that we could always run it > > with "make check"? This seems like an important corner case to > > regression test. > > Okay, I've attached a new version of the patch and a new version of > the repro that may be fast and stable enough to commit. It is more > minimal than the previous version. I made the table as small as I > could to still trigger two rounds of index vacuuming. I tried to make > it as stable as possible. I also removed the cursor on the standby > that could trigger recovery conflicts. It would be super helpful if > someone could take a look at the test and point out any ways I could > make it even more likely to be stable. Attached v3 has one important additional component in the test -- I use pg_stat_progress_vacuum to confirm that we actually do more than one pass of index vacuuming. Otherwise, it would have been trivial for the test to incorrectly pass. I could still use another pair of eyes on the test (looking out for stability enhancing measures I could take). I also would be happy if someone looked at the commit message and/or comments to let me know if they make sense. I'll finish with versions of the patch and test targeted at v14-16 and propose those before committing this. - Melanie
Attachment
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached v3 has one important additional component in the test -- I > use pg_stat_progress_vacuum to confirm that we actually do more than > one pass of index vacuuming. Otherwise, it would have been trivial for > the test to incorrectly pass. That's a good idea. > I could still use another pair of eyes on the test (looking out for > stability enhancing measures I could take). First, the basics: I found that your test failed reliably without your fix, and passed reliably with your fix. Next, performance: the total test runtime (as indicated by "time meson test -q recovery/043_vacuum_horizon_floor") was comparable to other recovery/* TAP tests. The new vacuum_horizon_floor test is a little on the long running side, as these tests go, but I think that's fine. Minor nitpicking about the comments in your TAP test: * It is necessary but not sufficient for your test to "skewer" maybe_needed, relative to OldestXmin. Obviously, it is not sufficient because the test can only fail when VACUUM prunes a heap page after the backend's horizons have been "skewered" in this sense. Pruning is when we get stuck, and if there's no more pruning then there's no opportunity for VACUUM to get stuck. Perhaps this point should be noted directly in the comments. You could add a sentence immediately after the existing sentence "Then vacuum's first pass will continue and pruning...". This new sentence would then add commentary such as "Finally, vacuum's second pass over the heap...". * Perhaps you should point out that you're using VACUUM FREEZE for this because it'll force the backend to always get a cleanup lock. This is something you rely on to make the repro reliable, but that's it. In other words, point out to the reader that this bug has nothing to do with freezing; it just so happens to be convenient to use VACUUM FREEZE this way, due to implementation details. * The sentence "VACUUM proceeds with pruning and does a visibility check on each tuple..." describes the bug in terms of the current state of things on Postgres 17, but Postgres 17 hasn't been released just yet. Does that really make sense? If you want to describe the invariant that caused heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the commit message of your fix seems like the right place for that. You could reference these errors in passing. The errors seem fairly incidental to the real problem, at least to me. I think that there is some chance that this test will break the build farm in whatever way, since there is a long history of VACUUM not quite behaving as expected with these sorts of tests. I think that you should commit the test case separately, first thing in the morning, and then keep an eye on the build farm for the rest of the day. I don't think that it's sensible to bend over backwards, just to avoid breaking the build farm in this way. Thanks for working on this -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I could still use another pair of eyes on the test (looking out for > > stability enhancing measures I could take). > > First, the basics: I found that your test failed reliably without your > fix, and passed reliably with your fix. Thanks for the review. > Minor nitpicking about the comments in your TAP test: > > * It is necessary but not sufficient for your test to "skewer" > maybe_needed, relative to OldestXmin. Obviously, it is not sufficient > because the test can only fail when VACUUM prunes a heap page after > the backend's horizons have been "skewered" in this sense. > > Pruning is when we get stuck, and if there's no more pruning then > there's no opportunity for VACUUM to get stuck. Perhaps this point > should be noted directly in the comments. You could add a sentence > immediately after the existing sentence "Then vacuum's first pass will > continue and pruning...". This new sentence would then add commentary > such as "Finally, vacuum's second pass over the heap...". I've added a description to the top of the test of the scenario required and then reworked the comment you are describing to try and make this more clear. > * Perhaps you should point out that you're using VACUUM FREEZE for > this because it'll force the backend to always get a cleanup lock. > This is something you rely on to make the repro reliable, but that's > it. > > In other words, point out to the reader that this bug has nothing to > do with freezing; it just so happens to be convenient to use VACUUM > FREEZE this way, due to implementation details. I've mentioned this in a comment. > * The sentence "VACUUM proceeds with pruning and does a visibility > check on each tuple..." describes the bug in terms of the current > state of things on Postgres 17, but Postgres 17 hasn't been released > just yet. Does that really make sense? In the patch targeted at master, I think it makes sense to describe the code as it is. In the backpatch versions, I reworked this comment to be correct for those versions. > If you want to describe the invariant that caused > heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the > commit message of your fix seems like the right place for that. You > could reference these errors in passing. The errors seem fairly > incidental to the real problem, at least to me. The errors are mentioned in the fix commit message. > I think that there is some chance that this test will break the build > farm in whatever way, since there is a long history of VACUUM not > quite behaving as expected with these sorts of tests. I think that you > should commit the test case separately, first thing in the morning, > and then keep an eye on the build farm for the rest of the day. I > don't think that it's sensible to bend over backwards, just to avoid > breaking the build farm in this way. Sounds good. - Melanie
Attachment
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 17, 2024 at 11:07 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > > I think that there is some chance that this test will break the build > > farm in whatever way, since there is a long history of VACUUM not > > quite behaving as expected with these sorts of tests. I think that you > > should commit the test case separately, first thing in the morning, > > and then keep an eye on the build farm for the rest of the day. I > > don't think that it's sensible to bend over backwards, just to avoid > > breaking the build farm in this way. > > Sounds good. Hmm. So, I was just running all the versions through CI again, and I noticed that the test failed on master on CI on Linux - Debian Bookworm with Meson. (This passes locally for me and has passed on previous CI runs). [15:43:41.547] stderr: [15:43:41.547] # poll_query_until timed out executing this query: [15:43:41.547] # [15:43:41.547] # SELECT index_vacuum_count > 0 [15:43:41.547] # FROM pg_stat_progress_vacuum [15:43:41.547] # WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass; [15:43:41.547] # [15:43:41.547] # expecting this output: [15:43:41.547] # t [15:43:41.547] # last actual query output: [15:43:41.547] # f We didn't end up doing two index vacuum passes. Because it doesn't repro locally for me, I can only assume that the conditions for forcing two index vacuuming passes in master just weren't met in this case. I'm unsurprised, as it is much harder since 17 to force two passes of index vacuuming. It seems like this might be as unstable as I feared. I could add more dead data. Or, I could just commit the test to the back branches before 17. What do you think? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > We didn't end up doing two index vacuum passes. Because it doesn't > repro locally for me, I can only assume that the conditions for > forcing two index vacuuming passes in master just weren't met in this > case. I'm unsurprised, as it is much harder since 17 to force two > passes of index vacuuming. It seems like this might be as unstable as > I feared. I could add more dead data. Or, I could just commit the test > to the back branches before 17. What do you think? How much margin of error do you have, in terms of total number of dead_items? That is, have you whittled it down to the minimum possible threshold for 2 passes? Some logging with VACUUM VERBOSE (run on the ci instance) might be illuminating. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 17, 2024 at 12:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > We didn't end up doing two index vacuum passes. Because it doesn't > > repro locally for me, I can only assume that the conditions for > > forcing two index vacuuming passes in master just weren't met in this > > case. I'm unsurprised, as it is much harder since 17 to force two > > passes of index vacuuming. It seems like this might be as unstable as > > I feared. I could add more dead data. Or, I could just commit the test > > to the back branches before 17. What do you think? > > How much margin of error do you have, in terms of total number of > dead_items? That is, have you whittled it down to the minimum possible > threshold for 2 passes? When I run it on my machine with some added logging, the space taken by dead items is about 330 kB more than maintenance_work_mem (which is set to 1 MB). I could roughly double the excess by increasing the number of inserted tuples from 400000 to 600000. I'll do this. > Some logging with VACUUM VERBOSE (run on the ci instance) might be illuminating. Vacuum verbose only will tell us the number of dead tuples and dead item identifiers but not how much space they take up -- which is how we decide whether or not to do index vacuuming. - Melanie
Melanie Plageman <melanieplageman@gmail.com> writes: > When I run it on my machine with some added logging, the space taken > by dead items is about 330 kB more than maintenance_work_mem (which is > set to 1 MB). I could roughly double the excess by increasing the > number of inserted tuples from 400000 to 600000. I'll do this. So, after about two days in the buildfarm, we have failure reports from this test on gull, mamba, mereswine, and copperhead. mamba is mine, and I was able to reproduce the failure in a manual run. The problem seems to be that the test simply takes too long and we hit the default 180-second timeout on one step or another. I was able to make it pass by dint of $ export PG_TEST_TIMEOUT_DEFAULT=1800 However, the test then took 908 seconds: $ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl ... # +++ tap install-check in src/test/recovery +++ t/043_vacuum_horizon_floor.pl .. ok All tests successful. Files=1, Tests=3, 908 wallclock secs ( 0.17 usr 0.01 sys + 21.42 cusr 35.03 csys = 56.63 CPU) Result: PASS 909.26 real 22.10 user 35.21 sys This is even slower than the 027_stream_regress.pl test, which currently takes around 847 seconds on that machine. mamba, gull, and mereswine are 32-bit machines, which aside from being old and slow suffer an immediate 2x size-of-test penalty: >> # The TIDStore vacuum uses to store dead items is optimized for its target >> # system. On a 32-bit system, our example requires twice as many pages with >> # the same number of dead items per page to fill the TIDStore and trigger a >> # second round of index vacuuming. >> my $is_64bit = $node_primary->safe_psql($test_db, >> qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]); >> >> my $nrows = $is_64bit eq 't' ? 400000 : 800000; copperhead is 64-bit but is nonetheless even slower than the other three, so the fact that it's also timing out isn't that surprising. I do not think the answer to this is to nag the respective animal owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply not worth the cycles it takes, at least not for these machines. I'm not sure whether to propose reverting it entirely or just disabling it on 32-bit hardware. I don't think we'd lose anything meaningful in test coverage if we did the latter; but that won't be enough to make copperhead happy. I am also suspicious that we'll get bad news from other very slow animals such as dikkop. I wonder if there is a less expensive way to trigger the test situation than brute-forcing things with a large index. Maybe the injection point infrastructure could help? regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I do not think the answer to this is to nag the respective animal > owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply > not worth the cycles it takes, at least not for these machines. Can't we just move it to PG_TEST_EXTRA? Alongside the existing "xid_wraparound" test? We didn't even have basic coverage of multi-pass VACUUMs before now. This new test added that coverage. I think that it will pull its weight. There will always be a small number of extremely slow buildfarm animals. Optimizing for things like Raspberry pi animals with SD cards just doesn't seem like a good use of developer time. I really care about keeping the tests fast, but only on platforms that hackers actually use for their development work. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do not think the answer to this is to nag the respective animal >> owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply >> not worth the cycles it takes, at least not for these machines. > Can't we just move it to PG_TEST_EXTRA? Alongside the existing > "xid_wraparound" test? Perhaps. xid_wraparound seems entirely too slow for what it's testing as well, if you ask me, and there's a concurrent thread about that test causing problems too. > There will always be a small number of extremely slow buildfarm > animals. Optimizing for things like Raspberry pi animals with SD cards > just doesn't seem like a good use of developer time. I really care > about keeping the tests fast, but only on platforms that hackers > actually use for their development work. I find this argument completely disingenuous. If a test is slow enough to cause timeout failures on slower machines, then it's also eating a disproportionate number of cycles in every other check-world run --- many of which have humans waiting for them to finish. Caring about the runtime of test cases is good for future-you not just obsolete buildfarm animals. I note also that the PG_TEST_EXTRA approach has caused xid_wraparound to get next-to-zero buildfarm coverage. If that test is actually capable of revealing problems, we're unlikely to find out under the status quo. regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > There will always be a small number of extremely slow buildfarm > > animals. Optimizing for things like Raspberry pi animals with SD cards > > just doesn't seem like a good use of developer time. I really care > > about keeping the tests fast, but only on platforms that hackers > > actually use for their development work. > > I find this argument completely disingenuous. Disingenuous? Really? > If a test is slow > enough to cause timeout failures on slower machines, then it's also > eating a disproportionate number of cycles in every other check-world > run --- many of which have humans waiting for them to finish. Caring > about the runtime of test cases is good for future-you not just > obsolete buildfarm animals. That's not necessarily true, though. I actually benchmarked this new test. I found that its runtime was a little on the long side as these recovery TAP tests go, but not to an excessive degree. It wasn't the slowest by any means. It's entirely possible that the new test would in fact be far slower than other comparable tests, were I to run a similar benchmark on something like a Raspberry pi with an SD card -- that would explain the apparent inconsistency here. Obviously Raspberry pi type hardware is expected to be much slower than the machine I use day to day, but that isn't the only thing that matters. A Raspberry pi can also have completely different performance characteristics to high quality workstation hardware. The CPU might be tolerably fast, while I/O is a huge bottleneck. > I note also that the PG_TEST_EXTRA approach has caused xid_wraparound > to get next-to-zero buildfarm coverage. If that test is actually > capable of revealing problems, we're unlikely to find out under the > status quo. I saw that. I think that there is significant value in providing a way for individual developers to test wraparound. Both by providing a TAP test, and providing the associated SQL callable C test functions. There is less value in testing it on every conceivable platform. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Melanie Plageman <melanieplageman@gmail.com> writes: > > When I run it on my machine with some added logging, the space taken > > by dead items is about 330 kB more than maintenance_work_mem (which is > > set to 1 MB). I could roughly double the excess by increasing the > > number of inserted tuples from 400000 to 600000. I'll do this. > > So, after about two days in the buildfarm, we have failure reports > from this test on gull, mamba, mereswine, and copperhead. mamba > is mine, and I was able to reproduce the failure in a manual run. > The problem seems to be that the test simply takes too long and > we hit the default 180-second timeout on one step or another. > I was able to make it pass by dint of > > $ export PG_TEST_TIMEOUT_DEFAULT=1800 > > However, the test then took 908 seconds: Thanks for taking the time to do this. If the test failures can be fixed by increasing timeout, that means that at least multiple index vacuums are reliably triggered with that number of rows. Obviously we can't have a super slow, flakey test, but I was worried the test might fail on different platforms because somehow the row count was insufficient to cause multiple index vacuums on some platforms for some reason (due to adaptive radix tree size being dependent on many factors). > $ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl > ... > # +++ tap install-check in src/test/recovery +++ > t/043_vacuum_horizon_floor.pl .. ok > All tests successful. > Files=1, Tests=3, 908 wallclock secs ( 0.17 usr 0.01 sys + 21.42 cusr 35.03 csys = 56.63 CPU) > Result: PASS > 909.26 real 22.10 user 35.21 sys > > This is even slower than the 027_stream_regress.pl test, which > currently takes around 847 seconds on that machine. > > mamba, gull, and mereswine are 32-bit machines, which aside from > being old and slow suffer an immediate 2x size-of-test penalty: > > >> # The TIDStore vacuum uses to store dead items is optimized for its target > >> # system. On a 32-bit system, our example requires twice as many pages with > >> # the same number of dead items per page to fill the TIDStore and trigger a > >> # second round of index vacuuming. > >> my $is_64bit = $node_primary->safe_psql($test_db, > >> qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]); > >> > >> my $nrows = $is_64bit eq 't' ? 400000 : 800000; > > copperhead is 64-bit but is nonetheless even slower than the > other three, so the fact that it's also timing out isn't > that surprising. > > I do not think the answer to this is to nag the respective animal > owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply > not worth the cycles it takes, at least not for these machines. > I'm not sure whether to propose reverting it entirely or just > disabling it on 32-bit hardware. I don't think we'd lose anything > meaningful in test coverage if we did the latter; but that won't be > enough to make copperhead happy. I am also suspicious that we'll > get bad news from other very slow animals such as dikkop. I am happy to do what Peter suggests and move it to PG_TEST_EXTRA, to disable for 32-bit, or to revert it. > I wonder if there is a less expensive way to trigger the test > situation than brute-forcing things with a large index. > Maybe the injection point infrastructure could help? The issue with an injection point is that we need more than for the vacuuming backend to pause at a specific point, we need a refresh of GlobalVisState to be forced at that point. Even if the horizon moves backward on the primary, this backend won't notice unless it has to update its GlobalVisState -- which often happens due to taking a new snapshot but this also happens at the end of index vacuuming explicitly. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Geoghegan <pg@bowt.ie> writes: > > On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I do not think the answer to this is to nag the respective animal > >> owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply > >> not worth the cycles it takes, at least not for these machines. > > > Can't we just move it to PG_TEST_EXTRA? Alongside the existing > > "xid_wraparound" test? > > Perhaps. xid_wraparound seems entirely too slow for what it's > testing as well, if you ask me, and there's a concurrent thread > about that test causing problems too. > > > There will always be a small number of extremely slow buildfarm > > animals. Optimizing for things like Raspberry pi animals with SD cards > > just doesn't seem like a good use of developer time. I really care > > about keeping the tests fast, but only on platforms that hackers > > actually use for their development work. > > I find this argument completely disingenuous. If a test is slow > enough to cause timeout failures on slower machines, then it's also > eating a disproportionate number of cycles in every other check-world > run --- many of which have humans waiting for them to finish. Caring > about the runtime of test cases is good for future-you not just > obsolete buildfarm animals. > > I note also that the PG_TEST_EXTRA approach has caused xid_wraparound > to get next-to-zero buildfarm coverage. If that test is actually > capable of revealing problems, we're unlikely to find out under the > status quo. What is the argument for PG_TEST_EXTRA if it is not running on almost any buildfarm animals? Are some of those tests valuable for other reasons than being consistently automatically run (e.g. developer understanding of how a particular part of code works)? If they aren't being run, how do we know that they still work (as test infrastructure changes)? The recovery conflict test is skipped on 15, which means that backported perl test changes may break it without us knowing. I don't know if you mean that PG_TEST_EXTRA tests are never run or just seldom run. If they are never run on the buildfarm, then they could end up silently breaking too. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sun, Jul 21, 2024 at 4:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I do not think the answer to this is to nag the respective animal > > owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply > > not worth the cycles it takes, at least not for these machines. > > Can't we just move it to PG_TEST_EXTRA? Alongside the existing > "xid_wraparound" test? > > We didn't even have basic coverage of multi-pass VACUUMs before now. > This new test added that coverage. I think that it will pull its > weight. Andres has suggested in the past that we allow maintenance_work_mem be set to a lower value or introduce some kind of development GUC so that we can more easily test multiple pass index vacuuming. Do you think this would be worth it? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-Jul-22, Melanie Plageman wrote: > On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I note also that the PG_TEST_EXTRA approach has caused xid_wraparound > > to get next-to-zero buildfarm coverage. If that test is actually > > capable of revealing problems, we're unlikely to find out under the > > status quo. > > What is the argument for PG_TEST_EXTRA if it is not running on almost > any buildfarm animals? Are some of those tests valuable for other > reasons than being consistently automatically run (e.g. developer > understanding of how a particular part of code works)? I think it's a bad idea to require buildfarm owners to edit their config files as we add tests that depend on PG_TEST_EXTRA. AFAIR we invented that setting so that tests that had security implications could be made opt-in instead of opt-out; I think this was a sensible thing to do, to avoid possibly compromising the machines in some way. But I think these new tests have a different problem, so we shouldn't use the same mechanism. What about some brainstorming to improve this? For example: have something in the tree that lets committers opt some tests out from specific BF machines without having to poke at the BF machines. I imagine two files: one that carries tags for buildfarm members, something like the /etc/groups file, src/test/tags.lst slow: gull,mamba,mereswine,copperhead and another file that lists tests to skip on members that have certain tags, src/tools/buildfarm/do_not_run.lst slow:src/test/modules/xid_wraparound slow:src/test/recovery/t/043_vacuum_horizon_floor.pl so that run_build.pl know that if the current member has tag slow, then these two tests are to be skipped. Then we can have xid_wraparound enabled generally (without requiring PG_TEST_EXTRA), and the BF client knows not to run it in the particular cases where it's not wanted. This proposal has a number of problems (a glaring one being the maintenance of the list of members per tag), but maybe it inspires better ideas. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Andres has suggested in the past that we allow maintenance_work_mem be > set to a lower value or introduce some kind of development GUC so that > we can more easily test multiple pass index vacuuming. Do you think > this would be worth it? No, I don't. -- Peter Geoghegan
Melanie Plageman <melanieplageman@gmail.com> writes: > On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound >> to get next-to-zero buildfarm coverage. If that test is actually >> capable of revealing problems, we're unlikely to find out under the >> status quo. > What is the argument for PG_TEST_EXTRA if it is not running on almost > any buildfarm animals? Are some of those tests valuable for other > reasons than being consistently automatically run (e.g. developer > understanding of how a particular part of code works)? The point of PG_TEST_EXTRA is to make some of the tests be opt-in. Originally it was just used for tests that might have security implications (e.g. the kerberos tests, which involve running a not-terribly-locked-down kerberos server). I'm a little suspicious of using it for tests that merely take an unreasonable amount of time --- to me, that indicates laziness on the part of the test author. It'd be better to get the test runtime down to the point where it's reasonable to expect all the buildfarm animals to run it. As an example, we're not getting any Valgrind coverage on xid_wraparound, and we won't ever get it with the current approach, which I find bad. > I don't know if you mean that PG_TEST_EXTRA tests are never > run or just seldom run. If they are never run on the buildfarm, then > they could end up silently breaking too. They are opt-in, meaning that both buildfarm owners and regular developers have to take extra action (i.e. set the PG_TEST_EXTRA environment variable) to run them. There's a reasonable number of animals opting into ssl, kerberos, etc, but I see only two that are opting into xid_wraparound. If we change this new test to be conditional on PG_TEST_EXTRA, it won't get run unless you successfully nag some buildfarm owners to run it. regards, tom lane
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: >> Andres has suggested in the past that we allow maintenance_work_mem be >> set to a lower value or introduce some kind of development GUC so that >> we can more easily test multiple pass index vacuuming. Do you think >> this would be worth it? > No, I don't. I don't see why that's not a good idea. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I think it's a bad idea to require buildfarm owners to edit their config > files as we add tests that depend on PG_TEST_EXTRA. AFAIR we invented > that setting so that tests that had security implications could be made > opt-in instead of opt-out; I think this was a sensible thing to do, to > avoid possibly compromising the machines in some way. But I think these > new tests have a different problem, so we shouldn't use the same > mechanism. That's my feeling also. > What about some brainstorming to improve this? > For example: have something in the tree that lets committers opt some > tests out from specific BF machines without having to poke at the BF > machines. I imagine two files: one that carries tags for buildfarm > members, something like the /etc/groups file, I'd turn it around, and provide some way for buildfarm owners to say "this machine is slow". Maybe make the tests respond to the presence of an environment variable PG_TEST_SKIP_SLOW, or some such thing. That particular solution would require no new infrastructure (such as a new buildfarm client release); it'd only require editing the config files of affected animals. regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Andres has suggested in the past that we allow maintenance_work_mem be > >> set to a lower value or introduce some kind of development GUC so that > >> we can more easily test multiple pass index vacuuming. Do you think > >> this would be worth it? > > > No, I don't. > > I don't see why that's not a good idea. I don't think that it's worth going to that trouble. Testing multiple passes isn't hard -- not in any real practical sense. I accept that there needs to be some solution to the problem of the tests timing out on slow running buildfarm animals. Your PG_TEST_SKIP_SLOW proposal seems like a good approach. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, On 2024-07-21 12:51:51 -0400, Tom Lane wrote: > Melanie Plageman <melanieplageman@gmail.com> writes: > > When I run it on my machine with some added logging, the space taken > > by dead items is about 330 kB more than maintenance_work_mem (which is > > set to 1 MB). I could roughly double the excess by increasing the > > number of inserted tuples from 400000 to 600000. I'll do this. > mamba, gull, and mereswine are 32-bit machines, which aside from > being old and slow suffer an immediate 2x size-of-test penalty: I think what we ought to do here is to lower the lower limit for memory usage for vacuum. With the new state in 17+ it basically has become impossible to test multi-pass vacuums in a way that won't get your test thrown out - that's bad. > I do not think the answer to this is to nag the respective animal owners to > raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply not worth the cycles > it takes, at least not for these machines. This specific area of the code has a *long* history of bugs, I'd be very loath to give up testing. Greetings, Andres Freund
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm a little suspicious > of using it for tests that merely take an unreasonable amount of > time --- to me, that indicates laziness on the part of the test > author. Laziness would have been not bothering to develop a TAP test for this at all. Going to the trouble of creating one and not being able to make it as fast or as stable as everybody would like is just being human. I never quite know what to do about TAP testing for issues like this. Ideally, we want a test case that runs quickly, is highly stable, is perfectly sensitive to the bug being fixed, and has a reasonable likelihood of being sensitive to future bugs of the same ilk. But such a test case need not exist, and even if it does, it need not be the case that any of us are able to find it. Or maybe finding it is possible but will take an unreasonable amount of time: if it took a committer six months to come up with such a test case for this bug, would that be worth it, or just overkill? I'd say overkill: I'd rather have that committer working on other stuff than spending six months trying to craft the perfect test case for a bug that's already fixed. Also, this particular bug seems to require a very specific combination of circumstances in order to trigger it. So the test gets complicated. As mentioned, that makes it harder to get the test case fast and stable, but it also reduces the chances that the test case will ever find anything. I don't think that this will be the last time we make a mistake around VACUUM's xmin handling, but the next mistake may well require an equally baroque but *different* setup to cause a problem. I hate to come to the conclusion that we just shouldn't test for this, but I don't think it's fair to send Melanie off on a wild goose chase looking for a perfect test case that may not realistically exist, either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-07-22 13:16:49 -0400, Robert Haas wrote: > On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm a little suspicious > > of using it for tests that merely take an unreasonable amount of > > time --- to me, that indicates laziness on the part of the test > > author. > > Laziness would have been not bothering to develop a TAP test for this > at all. Going to the trouble of creating one and not being able to > make it as fast or as stable as everybody would like is just being > human. Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind. It's not like somebody else had a great suggestion for how to do this in a better way either.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-07-22 12:00:51 -0400, Peter Geoghegan wrote: > On Mon, Jul 22, 2024 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Andres has suggested in the past that we allow maintenance_work_mem be > > >> set to a lower value or introduce some kind of development GUC so that > > >> we can more easily test multiple pass index vacuuming. Do you think > > >> this would be worth it? > > > > > No, I don't. > > > > I don't see why that's not a good idea. > > I don't think that it's worth going to that trouble. Testing multiple > passes isn't hard -- not in any real practical sense. It's hard by now (i.e. 17+) because you need substantial amounts of rows to be able to trigger it which makes it a hard fight to introduce. And the cost of setting the GUC limit lower is essentially zero. What's the point of having such a high lower limit? Greetings, Andres Freund
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 1:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm a little suspicious > > of using it for tests that merely take an unreasonable amount of > > time --- to me, that indicates laziness on the part of the test > > author. > > Laziness would have been not bothering to develop a TAP test for this > at all. Going to the trouble of creating one and not being able to > make it as fast or as stable as everybody would like is just being > human. > > I never quite know what to do about TAP testing for issues like this. > Ideally, we want a test case that runs quickly, is highly stable, is > perfectly sensitive to the bug being fixed, and has a reasonable > likelihood of being sensitive to future bugs of the same ilk. But such > a test case need not exist, and even if it does, it need not be the > case that any of us are able to find it. Or maybe finding it is > possible but will take an unreasonable amount of time: if it took a > committer six months to come up with such a test case for this bug, > would that be worth it, or just overkill? I'd say overkill: I'd rather > have that committer working on other stuff than spending six months > trying to craft the perfect test case for a bug that's already fixed. > > Also, this particular bug seems to require a very specific combination > of circumstances in order to trigger it. So the test gets complicated. > As mentioned, that makes it harder to get the test case fast and > stable, but it also reduces the chances that the test case will ever > find anything. I don't think that this will be the last time we make a > mistake around VACUUM's xmin handling, but the next mistake may well > require an equally baroque but *different* setup to cause a problem. I > hate to come to the conclusion that we just shouldn't test for this, > but I don't think it's fair to send Melanie off on a wild goose chase > looking for a perfect test case that may not realistically exist, > either. So, I've just gone through all the test failures on master and 17 for mamba, gull, mereswine, and copperhead. I wanted to confirm that the test was always failing for the same reason and also if it had any failures pre-TIDStore. We've only run tests with this commit on some of the back branches for some of these animals. Of those, I don't see any failures so far. So, it seems the test instability is just related to trying to get multiple passes of index vacuuming reliably with TIDStore. AFAICT, all the 32bit machine failures are timeouts waiting for the standby to catch up (mamba, gull, merswine). Unfortunately, the failures on copperhead (a 64 bit machine) are because we don't actually succeed in triggering a second vacuum pass. This would not be fixed by a longer timeout. Because of this, I'm inclined to revert the test on 17 and master to avoid distracting folks committing other work and seeing those animals go red. I wonder if Sawada-san or John have a test case minimally reproducing a case needing multiple index vacuuming rounds. You can't do it with my example and just more dead rows per page. If you just increase the number of dead tuples, it doesn't increase the size of the TIDStore unless those dead tuples are at different offsets. And I couldn't find DDL which would cause the TIDStore to be > 1MB without using a low fill-factor and many rows. Additionally, the fact that the same number of rows does not trigger the multiple passes on two different 64bit machines worries me and makes me think that we will struggle to trigger these conditions without overshooting the minimum by quite a bit. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > It's hard by now (i.e. 17+) because you need substantial amounts of rows to be > able to trigger it which makes it a hard fight to introduce. I didn't think that it was particularly hard when I tested the test that Melanie committed. > And the cost of > setting the GUC limit lower is essentially zero. Apparently you know more about TID Store than me. If it really is trivial to lower the limit, then I have no objections to doing so. That would make it easy to fix the test flappiness issues by just using the much lower limit. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > So, I've just gone through all the test failures on master and 17 for > mamba, gull, mereswine, and copperhead. I wanted to confirm that the > test was always failing for the same reason and also if it had any > failures pre-TIDStore. > > We've only run tests with this commit on some of the back branches for > some of these animals. Of those, I don't see any failures so far. So, > it seems the test instability is just related to trying to get > multiple passes of index vacuuming reliably with TIDStore. > > AFAICT, all the 32bit machine failures are timeouts waiting for the > standby to catch up (mamba, gull, merswine). Unfortunately, the > failures on copperhead (a 64 bit machine) are because we don't > actually succeed in triggering a second vacuum pass. This would not be > fixed by a longer timeout. > > Because of this, I'm inclined to revert the test on 17 and master to > avoid distracting folks committing other work and seeing those animals > go red. Okay, I reverted this for now on 17 and master. Adding Sawada-san to the thread to see if he has any ideas for a smaller two-round index vacuum example. - Melanie
Melanie Plageman <melanieplageman@gmail.com> writes: > We've only run tests with this commit on some of the back branches for > some of these animals. Of those, I don't see any failures so far. So, > it seems the test instability is just related to trying to get > multiple passes of index vacuuming reliably with TIDStore. > AFAICT, all the 32bit machine failures are timeouts waiting for the > standby to catch up (mamba, gull, merswine). Unfortunately, the > failures on copperhead (a 64 bit machine) are because we don't > actually succeed in triggering a second vacuum pass. This would not be > fixed by a longer timeout. Ouch. This seems to me to raise the importance of getting a better way to test multiple-index-vacuum-passes. Peter argued upthread that we don't need a better way, but I don't see how that argument holds water if copperhead was not reaching it despite being 64-bit. (Did you figure out exactly why it doesn't reach the code?) > Because of this, I'm inclined to revert the test on 17 and master to > avoid distracting folks committing other work and seeing those animals > go red. Agreed as a short-term measure. regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 10:37:17AM -0700, Andres Freund wrote: > On 2024-07-22 13:16:49 -0400, Robert Haas wrote: >> Laziness would have been not bothering to develop a TAP test for this >> at all. Going to the trouble of creating one and not being able to >> make it as fast or as stable as everybody would like is just being >> human. > > Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind. FWIW, I'm really impressed by what she has achieved here by Melanie, fixing a hard bug while hacking a crazily-complicated test to make it reproducible. This has an incredible amount of value in the long-run. > It's not like somebody else had a great suggestion for how to do this in a > better way either. Sawada-san and John are the two ones in the best position to answer that. I'm not sure either how to force a second index pass, either. Hmm. An idea would be to manipulate the TIDStore stack under the injection points switch? This is run by the CI and some buildfarm members already. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Sawada-san and John are the two ones in the best position to answer > that. I'm not sure either how to force a second index pass, either. Yeah, I think we've established that having some way to force that, without using a huge test case, would be really desirable. Maybe just provide a way to put an artificial limit on how many tuples processed per pass? (And no, I wasn't trying to rag on Melanie. My point here is that we've failed to design-in easy testability of this code path, and that's surely not her fault.) regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 2:04 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > So, I've just gone through all the test failures on master and 17 for > > mamba, gull, mereswine, and copperhead. I wanted to confirm that the > > test was always failing for the same reason and also if it had any > > failures pre-TIDStore. > > > > We've only run tests with this commit on some of the back branches for > > some of these animals. Of those, I don't see any failures so far. So, > > it seems the test instability is just related to trying to get > > multiple passes of index vacuuming reliably with TIDStore. > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > failures on copperhead (a 64 bit machine) are because we don't > > actually succeed in triggering a second vacuum pass. This would not be > > fixed by a longer timeout. > > > > Because of this, I'm inclined to revert the test on 17 and master to > > avoid distracting folks committing other work and seeing those animals > > go red. > > Okay, I reverted this for now on 17 and master. Adding Sawada-san to > the thread to see if he has any ideas for a smaller two-round index > vacuum example. > + CREATE TABLE ${table1}(col1 int) + WITH (autovacuum_enabled=false, fillfactor=10); + INSERT INTO $table1 VALUES(7); + INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3; + CREATE INDEX on ${table1}(col1); + UPDATE $table1 SET col1 = 3 WHERE col1 = 0; + INSERT INTO $table1 VALUES(7); These queries make sense to me; these make the radix tree wide and use more nodes, instead of fattening lead nodes (i.e. the offset bitmap). The $table1 has 18182 blocks and the statistics of radix tree shows: max_val = 65535 num_keys = 18182 height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182 Which means that the height of the tree is 2 and we use the maximum size node for all nodes except for 1 node. I don't have any great idea to substantially reduce the total number of tuples in the $table1. Probably we can use DELETE instead of UPDATE to make garbage tuples (although I'm not sure it's okay for this test). Which reduces the amount of WAL records from 11MB to 4MB and would reduce the time to catch up. But I'm not sure how much it would help. There might be ideas to trigger a two-round index vacuum with fewer tuples but if the tests are too optimized for the current TidStore, we will have to re-adjust them if the TidStore changes in the future. So I think it's better and reliable to allow maintenance_work_mem to be a lower value or use injection points somehow. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Melanie Plageman <melanieplageman@gmail.com> writes: > > We've only run tests with this commit on some of the back branches for > > some of these animals. Of those, I don't see any failures so far. So, > > it seems the test instability is just related to trying to get > > multiple passes of index vacuuming reliably with TIDStore. > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > failures on copperhead (a 64 bit machine) are because we don't > > actually succeed in triggering a second vacuum pass. This would not be > > fixed by a longer timeout. > > Ouch. This seems to me to raise the importance of getting a better > way to test multiple-index-vacuum-passes. Peter argued upthread > that we don't need a better way, but I don't see how that argument > holds water if copperhead was not reaching it despite being 64-bit. > (Did you figure out exactly why it doesn't reach the code?) I wasn't able to reproduce the failure (failing to do > 1 index vacuum pass) on my local machine (which is 64 bit) without decreasing the number of tuples inserted. The copperhead failure confuses me because the speed of the machine should *not* affect how much space the dead item TIDStore takes up. I would have bet money that the same number and offsets of dead tuples per page in a relation would take up the same amount of space in a TIDStore on any 64-bit system -- regardless of how slowly it runs vacuum. Here is some background on how I came up with the DDL and tuple count for the test: TIDStore uses 32 BITS_PER_BITMAPWORD on 32 bit systems and 64 on 64 bit systems. So, if you only have one bitmapword's worth of dead items per page, it was easy to figure out that you would need double the number of pages with dead items to take up the same amount of TIDStore space on a 32 bit system than on a 64 bit system. I wanted to figure out how to take up double the amount of TIDStore space *without* doubling the number of tuples. This is not straightforward. You can't just delete twice as many dead tuples per page. For starters, you can compactly represent many dead tuples in a single bitmapword. Outside of this, there seems to be some effect on the amount of space the adaptive radix tree takes up if the dead items on the pages are at the same offsets on all the pages. I thought this might have to do with being able to use the same chunk (in ART terms)? I spent some time trying to figure it out, but I gave up once I got confused enough to try and read the adaptive radix tree paper. I found myself wishing there was some way to visualize the TIDStore. I don't have good ideas how to represent this, but if we found one, we could add a function to the test_tidstore module. I also think it would be useful to have peak TIDStore usage in bytes in the vacuum verbose output. I had it on my list to propose something like this after I hacked together a version myself while trying to debug the test locally. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Melanie Plageman <melanieplageman@gmail.com> writes: > > > We've only run tests with this commit on some of the back branches for > > > some of these animals. Of those, I don't see any failures so far. So, > > > it seems the test instability is just related to trying to get > > > multiple passes of index vacuuming reliably with TIDStore. > > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > > failures on copperhead (a 64 bit machine) are because we don't > > > actually succeed in triggering a second vacuum pass. This would not be > > > fixed by a longer timeout. > > > > Ouch. This seems to me to raise the importance of getting a better > > way to test multiple-index-vacuum-passes. Peter argued upthread > > that we don't need a better way, but I don't see how that argument > > holds water if copperhead was not reaching it despite being 64-bit. > > (Did you figure out exactly why it doesn't reach the code?) > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum > pass) on my local machine (which is 64 bit) without decreasing the > number of tuples inserted. The copperhead failure confuses me because > the speed of the machine should *not* affect how much space the dead > item TIDStore takes up. I would have bet money that the same number > and offsets of dead tuples per page in a relation would take up the > same amount of space in a TIDStore on any 64-bit system -- regardless > of how slowly it runs vacuum. Looking at copperhead's failure logs, I could not find that "VACUUM (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index scans in logs. Is there any clue that made you think the test failed to do multiple index vacuum passes? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > Melanie Plageman <melanieplageman@gmail.com> writes: > > > > We've only run tests with this commit on some of the back branches for > > > > some of these animals. Of those, I don't see any failures so far. So, > > > > it seems the test instability is just related to trying to get > > > > multiple passes of index vacuuming reliably with TIDStore. > > > > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > > > failures on copperhead (a 64 bit machine) are because we don't > > > > actually succeed in triggering a second vacuum pass. This would not be > > > > fixed by a longer timeout. > > > > > > Ouch. This seems to me to raise the importance of getting a better > > > way to test multiple-index-vacuum-passes. Peter argued upthread > > > that we don't need a better way, but I don't see how that argument > > > holds water if copperhead was not reaching it despite being 64-bit. > > > (Did you figure out exactly why it doesn't reach the code?) > > > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum > > pass) on my local machine (which is 64 bit) without decreasing the > > number of tuples inserted. The copperhead failure confuses me because > > the speed of the machine should *not* affect how much space the dead > > item TIDStore takes up. I would have bet money that the same number > > and offsets of dead tuples per page in a relation would take up the > > same amount of space in a TIDStore on any 64-bit system -- regardless > > of how slowly it runs vacuum. > > Looking at copperhead's failure logs, I could not find that "VACUUM > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index > scans in logs. Is there any clue that made you think the test failed > to do multiple index vacuum passes? The vacuum doesn't actually finish because I have a cursor that keeps it from finishing and then I query pg_stat_progress_vacuum after the first index vacuuming round should have happened and it did not do the index vacuum: [20:39:34.645](351.522s) # poll_query_until timed out executing this query: # # SELECT index_vacuum_count > 0 # FROM pg_stat_progress_vacuum # WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass; # # expecting this output: # t # last actual query output: # f https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11 I suppose it is possible that it did in fact time out and the index vacuum was still in progress. But most of the other "too slow" failures were when the standby was trying to catch up. Usually the pg_stat_progress_vacuum test fails because we didn't actually do that index vacuuming round yet. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jul 23, 2024 at 5:43 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > Melanie Plageman <melanieplageman@gmail.com> writes: > > > > > We've only run tests with this commit on some of the back branches for > > > > > some of these animals. Of those, I don't see any failures so far. So, > > > > > it seems the test instability is just related to trying to get > > > > > multiple passes of index vacuuming reliably with TIDStore. > > > > > > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > > > > failures on copperhead (a 64 bit machine) are because we don't > > > > > actually succeed in triggering a second vacuum pass. This would not be > > > > > fixed by a longer timeout. > > > > > > > > Ouch. This seems to me to raise the importance of getting a better > > > > way to test multiple-index-vacuum-passes. Peter argued upthread > > > > that we don't need a better way, but I don't see how that argument > > > > holds water if copperhead was not reaching it despite being 64-bit. > > > > (Did you figure out exactly why it doesn't reach the code?) > > > > > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum > > > pass) on my local machine (which is 64 bit) without decreasing the > > > number of tuples inserted. The copperhead failure confuses me because > > > the speed of the machine should *not* affect how much space the dead > > > item TIDStore takes up. I would have bet money that the same number > > > and offsets of dead tuples per page in a relation would take up the > > > same amount of space in a TIDStore on any 64-bit system -- regardless > > > of how slowly it runs vacuum. > > > > Looking at copperhead's failure logs, I could not find that "VACUUM > > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index > > scans in logs. Is there any clue that made you think the test failed > > to do multiple index vacuum passes? > > The vacuum doesn't actually finish because I have a cursor that keeps > it from finishing and then I query pg_stat_progress_vacuum after the > first index vacuuming round should have happened and it did not do the > index vacuum: > > [20:39:34.645](351.522s) # poll_query_until timed out executing this query: > # > # SELECT index_vacuum_count > 0 > # FROM pg_stat_progress_vacuum > # WHERE datname='test_db' AND relid::regclass = > 'vac_horizon_floor_table'::regclass; > # > # expecting this output: > # t > # last actual query output: > # f > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11 > > I suppose it is possible that it did in fact time out and the index > vacuum was still in progress. But most of the other "too slow" > failures were when the standby was trying to catch up. Usually the > pg_stat_progress_vacuum test fails because we didn't actually do that > index vacuuming round yet. Thank you for your explanation! I understood the test cases. I figured out why two-round index vacuum was not triggered on copperhead although it's a 64-bit system. In short, this test case depends on MEMORY_CONTEXT_CHECK (or USE_ASSERT_CHECKING) being on. In this test case, every BlocktableEntry size would be 16 bytes; the header is 8 bytes and offset bitmap is 8 bytes (covering up to offset 63). We calculate the memory size (required_size in BumpAlloc()) to allocate in a bump memory context as follows: #ifdef MEMORY_CONTEXT_CHECKING /* ensure there's always space for the sentinel byte */ chunk_size = MAXALIGN(size + 1); #else chunk_size = MAXALIGN(size); #endif (snip) required_size = chunk_size + Bump_CHUNKHDRSZ; Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0. On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16 bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory and use more Bump memory blocks, resulting in filling up TidStore in the test cases. We can easily reproduce this test failure with PostgreSQL server built without --enable-cassert. It seems that copperhead is the sole BF animal that doesn't use --enable-cassert but runs recovery-check. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 24, 2024 at 5:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is > also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0. > On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is > bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16 > bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory > and use more Bump memory blocks, resulting in filling up TidStore in > the test cases. We can easily reproduce this test failure with > PostgreSQL server built without --enable-cassert. It seems that > copperhead is the sole BF animal that doesn't use --enable-cassert but > runs recovery-check. It seems we could force the bitmaps to be larger, and also reduce the number of updated tuples by updating only the last few tuples (say 5-10) by looking at the ctid's offset. This requires some trickery, but I believe I've done it in the past by casting to text and extracting with a regex. (I'm assuming the number of tuples updated is more important than the number of tuples inserted on a newly created table.) As for lowering the limit, we've experimented with 256kB here: https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com As I mention there, going lower than that would need a small amount of reorganization in the radix tree. Not difficult -- the thing I'm concerned about is that we'd likely need to document a separate minimum for DSA, since that behaves strangely with 256kB and might not work at all lower than that.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 24, 2024 at 2:42 PM John Naylor <johncnaylorls@gmail.com> wrote: > As for lowering the limit, we've experimented with 256kB here: > > https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com > > As I mention there, going lower than that would need a small amount of > reorganization in the radix tree. Not difficult -- the thing I'm > concerned about is that we'd likely need to document a separate > minimum for DSA, since that behaves strangely with 256kB and might not > work at all lower than that. For experimentation, here's a rough patch (really two, squashed together for now) that allows m_w_m to go down to 64kB. drop table if exists test; create table test (a int) with (autovacuum_enabled=false, fillfactor=10); insert into test (a) select i from generate_series(1,2000) i; create index on test (a); update test set a = a + 1; set maintenance_work_mem = '64kB'; vacuum (verbose) test; INFO: vacuuming "john.public.test" INFO: finished vacuuming "john.public.test": index scans: 3 pages: 0 removed, 91 remain, 91 scanned (100.00% of total) The advantage with this is that we don't need to care about MEMORY_CONTEXT_CHECKING or 32/64 bit-ness, since allocating a single large node will immediately blow the limit, and that will happen fairly quickly regardless. I suspect going this low will not work with dynamic shared memory and if so would need a warning comment.
Attachment
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > + CREATE TABLE ${table1}(col1 int) > + WITH (autovacuum_enabled=false, fillfactor=10); > + INSERT INTO $table1 VALUES(7); > + INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3; > + CREATE INDEX on ${table1}(col1); > + UPDATE $table1 SET col1 = 3 WHERE col1 = 0; > + INSERT INTO $table1 VALUES(7); > > These queries make sense to me; these make the radix tree wide and use > more nodes, instead of fattening lead nodes (i.e. the offset bitmap). > The $table1 has 18182 blocks and the statistics of radix tree shows: > > max_val = 65535 > num_keys = 18182 > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182 > > Which means that the height of the tree is 2 and we use the maximum > size node for all nodes except for 1 node. Do you have some kind of tool that prints this out for you? That would be really handy. > I don't have any great idea to substantially reduce the total number > of tuples in the $table1. Probably we can use DELETE instead of UPDATE > to make garbage tuples (although I'm not sure it's okay for this > test). Which reduces the amount of WAL records from 11MB to 4MB and > would reduce the time to catch up. But I'm not sure how much it would > help. There might be ideas to trigger a two-round index vacuum with > fewer tuples but if the tests are too optimized for the current > TidStore, we will have to re-adjust them if the TidStore changes in > the future. So I think it's better and reliable to allow > maintenance_work_mem to be a lower value or use injection points > somehow. I think we can make improvements in overall time on master and 17 with the examples John provided later in the thread. However, I realized you are right about using a DELETE instead of an UPDATE. At some point in my development, I needed the UPDATE to satisfy some other aspect of the test. But that is no longer true. A DELETE works just as well as an UPDATE WRT the dead items and, as you point out, much less WAL is created and replay is much faster. I also realized I forgot to add 043_vacuum_horizon_floor.pl to src/test/recovery/meson.build in 16. I will post a patch here this weekend which changes the UPDATE to a DELETE in 14-16 (sped up the test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl to src/test/recovery/meson.build in 16. I'll plan to push it on Monday to save myself any weekend buildfarm embarrassment. As for 17 and master, I'm going to try out John's examples and see if it seems like it will be fast enough to commit to 17/master without lowering the maintenance_work_mem lower bound. If we want to lower it, I wonder if we just halve it -- since it seems like the tests with half the number of tuples were fast enough to avoid timing out on slow animals on the buildfarm? Or do we need some more meaningful value to decrease it to? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 24, 2024 at 8:19 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 2:42 PM John Naylor <johncnaylorls@gmail.com> wrote: > > As for lowering the limit, we've experimented with 256kB here: > > > > https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com > > > > As I mention there, going lower than that would need a small amount of > > reorganization in the radix tree. Not difficult -- the thing I'm > > concerned about is that we'd likely need to document a separate > > minimum for DSA, since that behaves strangely with 256kB and might not > > work at all lower than that. > > For experimentation, here's a rough patch (really two, squashed > together for now) that allows m_w_m to go down to 64kB. Oh, great, thanks! I didn't read this closely enough before I posed my upthread question about how small we should make the minimum. It sounds like you've thought a lot about this. I ran my test with your patch (on my 64-bit system, non-assert build) and the result is great: master with my test (slightly modified to now use DELETE instead of UPDATE as mentioned upthread) 3.09s master with your patch applied, MWM set to 64kB and 9000 rows instead of 800000 1.06s > drop table if exists test; > create table test (a int) with (autovacuum_enabled=false, fillfactor=10); > insert into test (a) select i from generate_series(1,2000) i; > create index on test (a); > update test set a = a + 1; > > set maintenance_work_mem = '64kB'; > vacuum (verbose) test; > > INFO: vacuuming "john.public.test" > INFO: finished vacuuming "john.public.test": index scans: 3 > pages: 0 removed, 91 remain, 91 scanned (100.00% of total) > > The advantage with this is that we don't need to care about > MEMORY_CONTEXT_CHECKING or 32/64 bit-ness, since allocating a single > large node will immediately blow the limit, and that will happen > fairly quickly regardless. I suspect going this low will not work with > dynamic shared memory and if so would need a warning comment. I took a look at the patch, but I can't say I know enough about the memory allocation subsystems and how TIDStore works to meaningfully review it -- nor enough about DSM to comment about the interactions. I suspect 256kB would also be fast enough to avoid my test timing out on the buildfarm, but it is appealing to have a minimum for maintenance_work_mem that is the same as work_mem. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 24, 2024 at 3:43 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 5:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is > > also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0. > > On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is > > bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16 > > bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory > > and use more Bump memory blocks, resulting in filling up TidStore in > > the test cases. We can easily reproduce this test failure with > > PostgreSQL server built without --enable-cassert. It seems that > > copperhead is the sole BF animal that doesn't use --enable-cassert but > > runs recovery-check. > > It seems we could force the bitmaps to be larger, and also reduce the > number of updated tuples by updating only the last few tuples (say > 5-10) by looking at the ctid's offset. This requires some trickery, > but I believe I've done it in the past by casting to text and > extracting with a regex. (I'm assuming the number of tuples updated is > more important than the number of tuples inserted on a newly created > table.) Yes, the only thing that is important is having two rounds of index vacuuming and having one tuple with a value matching my cursor condition before the first index vacuum and one after. What do you mean update only the last few tuples though? - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Saturday, July 27, 2024, Melanie Plageman <melanieplageman@gmail.com> wrote:
Yes, the only thing that is important is having two rounds of index
vacuuming and having one tuple with a value matching my cursor
condition before the first index vacuum and one after. What do you
mean update only the last few tuples though?
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Fri, Jul 26, 2024 at 1:27 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > + CREATE TABLE ${table1}(col1 int) > > + WITH (autovacuum_enabled=false, fillfactor=10); > > + INSERT INTO $table1 VALUES(7); > > + INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3; > > + CREATE INDEX on ${table1}(col1); > > + UPDATE $table1 SET col1 = 3 WHERE col1 = 0; > > + INSERT INTO $table1 VALUES(7); > > > > These queries make sense to me; these make the radix tree wide and use > > more nodes, instead of fattening lead nodes (i.e. the offset bitmap). > > The $table1 has 18182 blocks and the statistics of radix tree shows: > > > > max_val = 65535 > > num_keys = 18182 > > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182 > > > > Which means that the height of the tree is 2 and we use the maximum > > size node for all nodes except for 1 node. > > Do you have some kind of tool that prints this out for you? That would > be really handy. You can add '#define RT_DEBUG' for radix tree used in TidStore and then call RT_STATS (e.g., local_ts_stats()). > > > I don't have any great idea to substantially reduce the total number > > of tuples in the $table1. Probably we can use DELETE instead of UPDATE > > to make garbage tuples (although I'm not sure it's okay for this > > test). Which reduces the amount of WAL records from 11MB to 4MB and > > would reduce the time to catch up. But I'm not sure how much it would > > help. There might be ideas to trigger a two-round index vacuum with > > fewer tuples but if the tests are too optimized for the current > > TidStore, we will have to re-adjust them if the TidStore changes in > > the future. So I think it's better and reliable to allow > > maintenance_work_mem to be a lower value or use injection points > > somehow. > > I think we can make improvements in overall time on master and 17 with > the examples John provided later in the thread. However, I realized > you are right about using a DELETE instead of an UPDATE. At some point > in my development, I needed the UPDATE to satisfy some other aspect of > the test. But that is no longer true. A DELETE works just as well as > an UPDATE WRT the dead items and, as you point out, much less WAL is > created and replay is much faster. > > I also realized I forgot to add 043_vacuum_horizon_floor.pl to > src/test/recovery/meson.build in 16. I will post a patch here this > weekend which changes the UPDATE to a DELETE in 14-16 (sped up the > test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl > to src/test/recovery/meson.build in 16. I'll plan to push it on Monday > to save myself any weekend buildfarm embarrassment. > > As for 17 and master, I'm going to try out John's examples and see if > it seems like it will be fast enough to commit to 17/master without > lowering the maintenance_work_mem lower bound. +1. Thanks. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > In back branches starting with 14, failing to remove tuples older than > OldestXmin during pruning caused vacuum to infinitely loop in > lazy_scan_prune(), as investigated on this [1] thread. Shouldn't somebody remove the entry that we have for this issue under https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches? -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Wed, Jul 31, 2024 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > In back branches starting with 14, failing to remove tuples older than > > OldestXmin during pruning caused vacuum to infinitely loop in > > lazy_scan_prune(), as investigated on this [1] thread. > > Shouldn't somebody remove the entry that we have for this issue under > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches? Thanks for the reminder. Done! - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sat, Jul 27, 2024 at 4:04 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 8:19 AM John Naylor <johncnaylorls@gmail.com> wrote: > I ran my test with your patch (on my 64-bit system, non-assert build) > and the result is great: > > master with my test (slightly modified to now use DELETE instead of > UPDATE as mentioned upthread) > 3.09s > > master with your patch applied, MWM set to 64kB and 9000 rows instead of 800000 > 1.06s Glad to hear it! > I took a look at the patch, but I can't say I know enough about the > memory allocation subsystems and how TIDStore works to meaningfully > review it -- nor enough about DSM to comment about the interactions. I tried using parallel vacuum with 64kB and it succeeded, but needed to perform an index scan for every heap page pruned. It's not hard to imagine some code moving around so that it doesn't work anymore, but since this is for testing only, it seems a warning comment is enough. > I suspect 256kB would also be fast enough to avoid my test timing out > on the buildfarm, but it is appealing to have a minimum for > maintenance_work_mem that is the same as work_mem. Agreed on both counts: I came up with a simple ctid expression to make the bitmap arrays larger: delete from test where ctid::text like '%,2__)'; With that, it still takes between 250k and 300k tuples to force a second index scan with 256kB m_w_m, default fillfactor, and without asserts. (It may need a few more pages for 32-bit but not many more) The table is around 1300 pages, where on v16 it's about 900. But with fewer tuples deleted, the WAL for deletes should be lower. So it might be comparable to v16's test. It also turns out that to support 64kB memory settings, we actually wouldn't need to change radix tree to lazily create memory contexts -- at least currently, SlabCreate doesn't allocate a keeper block, so a newly created slab context reports 0 for "mem_allocated". So I'm inclined to go ahead change the minimum m_w_m on v17 and master to 64kB. It's the quickest and (I think) most future-proof way to make this test work. Any objections?
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Aug 6, 2024 at 9:58 PM John Naylor <johncnaylorls@gmail.com> wrote: > > It also turns out that to support 64kB memory settings, we actually > wouldn't need to change radix tree to lazily create memory contexts -- > at least currently, SlabCreate doesn't allocate a keeper block, so a > newly created slab context reports 0 for "mem_allocated". So I'm > inclined to go ahead change the minimum m_w_m on v17 and master to > 64kB. It's the quickest and (I think) most future-proof way to make > this test work. Any objections? This is done. I also changed autovacuum_work_mem just for the sake of consistency. I did some quick math and found that there shouldn't be a difference between 32- and 64-bit platforms for when they exceed 64kB in the tid store. That's because exceeding the limit is caused by allocating the first block of one of the slab contexts. That independence may not be stable, so I'm thinking of hard-coding the block sizes in master only, but I've left that for another time.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 2024-08-10 16:01:18 +0700, John Naylor wrote: > On Tue, Aug 6, 2024 at 9:58 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > It also turns out that to support 64kB memory settings, we actually > > wouldn't need to change radix tree to lazily create memory contexts -- > > at least currently, SlabCreate doesn't allocate a keeper block, so a > > newly created slab context reports 0 for "mem_allocated". So I'm > > inclined to go ahead change the minimum m_w_m on v17 and master to > > 64kB. It's the quickest and (I think) most future-proof way to make > > this test work. Any objections? > > This is done. I also changed autovacuum_work_mem just for the sake of > consistency. I did some quick math and found that there shouldn't be a > difference between 32- and 64-bit platforms for when they exceed 64kB > in the tid store. That's because exceeding the limit is caused by > allocating the first block of one of the slab contexts. That > independence may not be stable, so I'm thinking of hard-coding the > block sizes in master only, but I've left that for another time. Thanks a lot! Greetings, Andres Freund