Thread: Intermittent buildfarm failures on wrasse
For the past five days or so, wrasse has been intermittently failing due to unexpectedly not using an Index Only Scan plan in the create_index test [1], eg @@ -1910,11 +1910,15 @@ SELECT unique1 FROM tenk1 WHERE unique1 IN (1,42,7) ORDER BY unique1; - QUERY PLAN -------------------------------------------------------- - Index Only Scan using tenk1_unique1 on tenk1 - Index Cond: (unique1 = ANY ('{1,42,7}'::integer[])) -(2 rows) + QUERY PLAN +------------------------------------------------------------------- + Sort + Sort Key: unique1 + -> Bitmap Heap Scan on tenk1 + Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[])) + -> Bitmap Index Scan on tenk1_unique1 + Index Cond: (unique1 = ANY ('{1,42,7}'::integer[])) +(6 rows) SELECT unique1 FROM tenk1 WHERE unique1 IN (1,42,7) The most probable explanation for this seems to be that tenk1's pg_class.relallvisible value hasn't been set high enough to make an IOS look cheaper than the alternatives. Where that ought to be getting set is the "VACUUM ANALYZE tenk1" step in test_setup.sql. It's plausible I guess that a background autovacuum is preventing that command from setting relallvisible as high as it ought to be --- but if so, why are we only seeing two plans changing, on only one animal? But what I'm really confused about is that this test arrangement has been stable since early February. Why has wrasse suddenly started showing a 25% failure rate when it never failed this way before that? Somebody has to have recently committed a change that affects this. Checking the commit log up to the onset of the failures on 8 April, I only see two plausible candidates: * shared-memory pgstats * Peter's recent VACUUM changes Any connection to pgstats is, um, pretty obscure. I'd finger the VACUUM changes as a more likely trigger except that the last interesting-looking one was f3c15cbe5 on 3 April, and wrasse got through "make check" 38 times after that before its first failure of this kind. That doesn't square with the 25% failure rate since then, so I'm kind of forced to the conclusion that the pgstats work changed some behavior that it should not have. Any ideas? I'm tempted to add something like SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1'; so that we can confirm or refute the theory that relallvisible is the driving factor. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-08%2003%3A48%3A30
On Wed, Apr 13, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm tempted to add something like > > SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1'; > > so that we can confirm or refute the theory that relallvisible is > the driving factor. It would be fairly straightforward to commit a temporary debugging patch that has the autovacuum logging stuff report directly on how VACUUM set new_rel_allvisible in pg_class. We should probably be doing that already, just because it's useful information that is already close at hand. Might be a bit trickier to make sure that wrasse reliably reported on all relevant VACUUMs, since that would have to include manual VACUUMs (which would really have to use VACUUM VERBOSE), as well as autovacuums. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Apr 13, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm tempted to add something like >> SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1'; >> so that we can confirm or refute the theory that relallvisible is >> the driving factor. > It would be fairly straightforward to commit a temporary debugging > patch that has the autovacuum logging stuff report directly on how > VACUUM set new_rel_allvisible in pg_class. We should probably be doing > that already, just because it's useful information that is already > close at hand. Doesn't look like wrasse has autovacuum logging enabled, though. After a bit more navel-contemplation I see a way that the pgstats work could have changed timing in this area. We used to have a rate limit on how often stats reports would be sent to the collector, which'd ensure half a second or so delay before a transaction's change counts became visible to the autovac daemon. I've not looked at the new code, but I'm betting that that's gone and the autovac launcher might start a worker nearly immediately after some foreground process finishes inserting some rows. So that could result in autovac activity occurring concurrently with test_setup where it didn't before. As to what to do about it ... maybe apply the FREEZE and DISABLE_PAGE_SKIPPING options in test_setup's vacuums? It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps not sufficient. regards, tom lane
On Thu, 14 Apr 2022 at 10:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: > After a bit more navel-contemplation I see a way that the pgstats > work could have changed timing in this area. We used to have a > rate limit on how often stats reports would be sent to the > collector, which'd ensure half a second or so delay before a > transaction's change counts became visible to the autovac daemon. > I've not looked at the new code, but I'm betting that that's gone > and the autovac launcher might start a worker nearly immediately > after some foreground process finishes inserting some rows. > So that could result in autovac activity occurring concurrently > with test_setup where it didn't before. It's not quite clear to me why the manual vacuum wouldn't just cancel the autovacuum and complete the job. I can't quite see how there's room for competing page locks here. Also, see [1]. One of the reported failing tests there is the same as one of the failing tests on wrasse. My investigation for the AIO branch found that relallvisible was not equal to relpages. I don't recall the reason why that was happening now. > As to what to do about it ... maybe apply the FREEZE and > DISABLE_PAGE_SKIPPING options in test_setup's vacuums? > It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps > not sufficient. We should likely try and confirm it's due to relallvisible first. David [1] https://www.postgresql.org/message-id/20220224153339.pqn64kseb5gpgl74@alap3.anarazel.de
On Wed, Apr 13, 2022 at 3:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > After a bit more navel-contemplation I see a way that the pgstats > work could have changed timing in this area. We used to have a > rate limit on how often stats reports would be sent to the > collector, which'd ensure half a second or so delay before a > transaction's change counts became visible to the autovac daemon. > I've not looked at the new code, but I'm betting that that's gone > and the autovac launcher might start a worker nearly immediately > after some foreground process finishes inserting some rows. > So that could result in autovac activity occurring concurrently > with test_setup where it didn't before. But why should it matter? The test_setup.sql VACUUM of tenk1 should leave relallvisible and relpages in the same state, either way (or very close to it). The only way that it seems like it could matter is if OldestXmin was held back during test_setup.sql's execution of the VACUUM command. > As to what to do about it ... maybe apply the FREEZE and > DISABLE_PAGE_SKIPPING options in test_setup's vacuums? > It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps > not sufficient. BTW, the work on VACUUM for Postgres 15 probably makes VACUUM test flappiness issues less of a problem -- unless they're issues involving something holding back OldestXmin when it shouldn't (in which case it won't have any effect on test stability). I would expect that to be the case, at least, since VACUUM now does almost all of the same work for any individual page that it cannot get a cleanup lock on. There is surprisingly little difference between a page that gets processed by lazy_scan_prune and a page that gets processed by lazy_scan_noprune. -- Peter Geoghegan
Hi, On April 13, 2022 7:06:33 PM EDT, David Rowley <dgrowleyml@gmail.com> wrote: >On Thu, 14 Apr 2022 at 10:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> After a bit more navel-contemplation I see a way that the pgstats >> work could have changed timing in this area. We used to have a >> rate limit on how often stats reports would be sent to the >> collector, which'd ensure half a second or so delay before a >> transaction's change counts became visible to the autovac daemon. >> I've not looked at the new code, but I'm betting that that's gone >> and the autovac launcher might start a worker nearly immediately >> after some foreground process finishes inserting some rows. >> So that could result in autovac activity occurring concurrently >> with test_setup where it didn't before. > >It's not quite clear to me why the manual vacuum wouldn't just cancel >the autovacuum and complete the job. I can't quite see how there's >room for competing page locks here. Also, see [1]. One of the >reported failing tests there is the same as one of the failing tests >on wrasse. My investigation for the AIO branch found that >relallvisible was not equal to relpages. I don't recall the reason why >that was happening now. > >> As to what to do about it ... maybe apply the FREEZE and >> DISABLE_PAGE_SKIPPING options in test_setup's vacuums? >> It seems like DISABLE_PAGE_SKIPPING is necessary but perhaps >> not sufficient. > >We should likely try and confirm it's due to relallvisible first. We had this issue before, and not just on the aio branch. On my phone right now, so won't look up references. IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible. An independent autovac worker starting is enough for that, for example. Previously the data load and vacuum werefurther apart, preventing this kind of issue. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: > IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible. Sometimes OldestXmin can go backwards in VACUUM operations that are run in close succession against the same table, due to activity from other databases in the same cluster (perhaps other factors are involved at times). That's why the following assertion that I recently added to vacuumlazy.c will fail pretty quickly without the "vacrel->NewRelfrozenXid == OldestXmin" part of its test: Assert(vacrel->NewRelfrozenXid == OldestXmin || TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit : vacrel->relfrozenxid, vacrel->NewRelfrozenXid)); If you remove "vacrel->NewRelfrozenXid == OldestXmin", and run the regression tests, the remaining assertion will fail quite easily. Though perhaps not with a serial "make check". -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: >> IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all visible. > Sometimes OldestXmin can go backwards in VACUUM operations that are > run in close succession against the same table, due to activity from > other databases in the same cluster (perhaps other factors are > involved at times). I've been doing some testing locally by inserting commands to manually set tenk1's relallvisible to zero. I first did that in test_setup.sql ... and it had no effect whatsoever. Further experimentation showed that the "CREATE INDEX ON tenk1" steps in create_index.sql itself generally suffice to fix relallvisible; although if you force it back to zero after the last such command, you get the same plan diffs wrasse is showing. And you don't get any others, which I thought curious until I realized that sanity_check.sql's database-wide VACUUM offers yet another opportunity to heal the incorrect value. If you force it back to zero again after that, a bunch of later tests start to show plan differences, which is what I'd been expecting. So what seems to be happening on wrasse is that a background autovacuum (or really autoanalyze?) is preventing pages from being marked all-visible not only during test_setup.sql but also create_index.sql; but it's gone by the time sanity_check.sql runs. Which is odd in itself because not that much time elapses between create_index and sanity_check, certainly less than the time from test_setup to create_index. It seems like a reliable fix might require test_setup to wait for any background autovac to exit before it does its own vacuums. Ick. And we still lack an explanation of why this only now broke. I remain suspicious that pgstats is behaving unexpectedly. regards, tom lane
On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So what seems to be happening on wrasse is that a background > autovacuum (or really autoanalyze?) is preventing pages from > being marked all-visible not only during test_setup.sql but > also create_index.sql; but it's gone by the time sanity_check.sql > runs. I agree that it would need to be an autoanalyze (due to the PROC_IN_VACUUM optimization). > It seems like a reliable fix might require test_setup to wait > for any background autovac to exit before it does its own > vacuums. Ick. This is hardly a new problem, really. I wonder if it's worth inventing a comprehensive solution. Some kind of infrastructure that makes VACUUM establish a next XID up-front (by calling ReadNextTransactionId()), and then find a way to run with an OldestXmin that's >= the earleir "next" XID value. If necessary by waiting. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It seems like a reliable fix might require test_setup to wait >> for any background autovac to exit before it does its own >> vacuums. Ick. > This is hardly a new problem, really. I wonder if it's worth inventing > a comprehensive solution. Yeah, we have band-aided around this type of problem repeatedly. Making a fix that's readily accessible from any test script seems like a good idea. regards, tom lane
On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, we have band-aided around this type of problem repeatedly. > Making a fix that's readily accessible from any test script > seems like a good idea. We might even be able to consistently rely on this new option, given *any* problem involving test stability and VACUUM. Having a one-size-fits-all solution to these kinds of stability problems would be nice -- no more DISABLE_PAGE_SKIPPING bandaids. That general approach will be possible provided an inability to acquire a cleanup lock during VACUUM (which can more or less occur at random in most environments) doesn't ever lead to unexpected test results. There is good reason to think that it might work out that way. Simulating problems with acquiring cleanup locks during VACUUM left me with the impression that that could really work: https://postgr.es/m/CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com -- Peter Geoghegan
Hi, On 2022-04-13 16:45:44 -0700, Peter Geoghegan wrote: > On Wed, Apr 13, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So what seems to be happening on wrasse is that a background > > autovacuum (or really autoanalyze?) is preventing pages from > > being marked all-visible not only during test_setup.sql but > > also create_index.sql; but it's gone by the time sanity_check.sql > > runs. > > I agree that it would need to be an autoanalyze (due to the > PROC_IN_VACUUM optimization). That's not a realiable protection - the snapshot is established normally at first, only after a while we set PROC_IN_VACUUM... Greetings, Andres Freund
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, we have band-aided around this type of problem repeatedly. >> Making a fix that's readily accessible from any test script >> seems like a good idea. > We might even be able to consistently rely on this new option, given > *any* problem involving test stability and VACUUM. Having a > one-size-fits-all solution to these kinds of stability problems would > be nice -- no more DISABLE_PAGE_SKIPPING bandaids. My guess is that you'd need both this new wait-for-horizon behavior *and* DISABLE_PAGE_SKIPPING. But the two together ought to make for pretty reproducible behavior. I noticed while scanning the commit log that some patches have tried adding a FREEZE option, which seems more like waving a dead chicken than anything that'd improve stability. We'd not necessarily have to embed wait-for-horizon into VACUUM itself. It seems like a SQL-accessible function could be written and then called before any problematic VACUUM. I like this better for something we're thinking of jamming in post-feature-freeze; we'd not be committing to the feature quite as much as if we added a VACUUM option. regards, tom lane
Hi, On 2022-04-13 18:54:06 -0400, Tom Lane wrote: > We used to have a rate limit on how often stats reports would be sent > to the collector, which'd ensure half a second or so delay before a > transaction's change counts became visible to the autovac daemon. Just for posterity: That's not actually gone. But what is gone is the rate limiting in autovacuum about requesting recent stats for a table / autovac seeing slightly older stats. Greetings, Andres Freund
On Wed, Apr 13, 2022 at 5:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > My guess is that you'd need both this new wait-for-horizon behavior > *and* DISABLE_PAGE_SKIPPING. But the two together ought to make > for pretty reproducible behavior. I noticed while scanning the > commit log that some patches have tried adding a FREEZE option, > which seems more like waving a dead chicken than anything that'd > improve stability. I think that it's more likely that FREEZE will correct problems, out of the two: * FREEZE forces an aggressive VACUUM whose FreezeLimit is as recent a cutoff value as possible (FreezeLimit will be equal to OldestXmin). * DISABLE_PAGE_SKIPPING also forces an aggressive VACUUM. But unlike FREEZE it makes VACUUM not use the visibility map, even in the case of all-frozen pages. And it changes nothing about FreezeLimit. It's also a certainty that VACUUM(FREEZE, DISABLE_PAGE_SKIPPING) is not a sensible remedy for any problem with test stability, but there are still some examples of that combination in the regression tests. The only way it could make sense is if the visibility map might be corrupt, but surely we're not expecting that anyway (and if we were we'd be testing it more directly). I recently argued that DISABLE_PAGE_SKIPPING should have nothing to do with aggressive vacuuming -- that should all be left up to VACUUM FREEZE. It seems more logical to make DISABLE_PAGE_SKIPPING mean "don't use the visibility map to skip anything", without bringing aggressiveness into it at all. That would be less confusing. > We'd not necessarily have to embed wait-for-horizon into VACUUM > itself. It seems like a SQL-accessible function could be written > and then called before any problematic VACUUM. I like this better > for something we're thinking of jamming in post-feature-freeze; > we'd not be committing to the feature quite as much as if we > added a VACUUM option. Hmm. I would say that the feature has zero appeal to users anyway. Maybe it can and should be done through an SQL function for other reasons, though. Users already think that there are several different flavors of VACUUM, which isn't really true. -- Peter Geoghegan
Hi, On 2022-04-13 20:35:50 -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > On Wed, Apr 13, 2022 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Yeah, we have band-aided around this type of problem repeatedly. > >> Making a fix that's readily accessible from any test script > >> seems like a good idea. > > > We might even be able to consistently rely on this new option, given > > *any* problem involving test stability and VACUUM. Having a > > one-size-fits-all solution to these kinds of stability problems would > > be nice -- no more DISABLE_PAGE_SKIPPING bandaids. > > My guess is that you'd need both this new wait-for-horizon behavior > *and* DISABLE_PAGE_SKIPPING. But the two together ought to make > for pretty reproducible behavior. I noticed while scanning the > commit log that some patches have tried adding a FREEZE option, > which seems more like waving a dead chicken than anything that'd > improve stability. I think most of those we've ended up replacing by using temp tables in those tests instead, since they're not affected by the global horizon anymore. > We'd not necessarily have to embed wait-for-horizon into VACUUM > itself. I'm not sure it'd be quite reliable outside of vacuum though, due to the horizon potentially going backwards (in otherwise harmless ways)? > It seems like a SQL-accessible function could be written > and then called before any problematic VACUUM. I like this better > for something we're thinking of jamming in post-feature-freeze; > we'd not be committing to the feature quite as much as if we > added a VACUUM option. We could otherwise just disable IOS for that query, for now. Greetings, Andres Freund
On Wed, Apr 13, 2022 at 6:03 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that it's more likely that FREEZE will correct problems, out of the two: > > * FREEZE forces an aggressive VACUUM whose FreezeLimit is as recent a > cutoff value as possible (FreezeLimit will be equal to OldestXmin). The reason why that might have helped (at least in the past) is that it's enough to force us to wait for a cleanup lock to prune and freeze, if necessary. Which was never something that DISABLE_PAGE_SKIPPING could do. -- Peter Geoghegan
On Wed, Apr 13, 2022 at 6:05 PM Andres Freund <andres@anarazel.de> wrote: > I think most of those we've ended up replacing by using temp tables in > those tests instead, since they're not affected by the global horizon > anymore. Maybe, but it's a pain to have to work that way. You can't do it in cases like this, because a temp table is not workable. So that's not an ideal long term solution. > > We'd not necessarily have to embed wait-for-horizon into VACUUM > > itself. > > I'm not sure it'd be quite reliable outside of vacuum though, due to the > horizon potentially going backwards (in otherwise harmless ways)? I agree, since vacuumlazy.c would need to either be given its own OldestXmin, or knowledge of a wait-up-to XID. Either way we have to make non-trivial changes to vacuumlazy.c. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > On 2022-04-13 20:35:50 -0400, Tom Lane wrote: >> It seems like a SQL-accessible function could be written >> and then called before any problematic VACUUM. I like this better >> for something we're thinking of jamming in post-feature-freeze; >> we'd not be committing to the feature quite as much as if we >> added a VACUUM option. > We could otherwise just disable IOS for that query, for now. The entire point of that test case is to verify the shape of the IOS plan, so no that's not an acceptable answer. But if we're looking for quick hacks, we could do update pg_class set relallvisible = relpages where relname = 'tenk1'; just before that test. I'm still suspicious of the pgstat changes, though. I checked into things here by doing initdb edit postgresql.conf to set log_autovacuum_min_duration = 0 pg_ctl start && make installcheck-parallel and what I see is that the first reported autovacuum activity begins exactly one minute after the postmaster starts, which is what I'd expect given the autovacuum naptime rules. On my machine, of course, the installcheck-parallel run is long gone by then. But even on the much slower wrasse, we should be well past create_index by the time any autovac worker launches --- you can see from wrasse's reported test runtimes that only about 10 seconds have elapsed when it get to the end of create_index. This suggests to me that what is holding the (presumed) conflicting snapshot must be the autovac launcher, because what else could it be? So I'm suspicious that we broke something in that area, though I am baffled why only wrasse would be telling us so. regards, tom lane
Hi, Noah, any chance you could enable log_autovacuum_min_duration=0 on wrasse? On 2022-04-13 21:23:12 -0400, Tom Lane wrote: > I'm still suspicious of the pgstat changes, though. I checked into > things here by doing > > initdb > edit postgresql.conf to set log_autovacuum_min_duration = 0 > pg_ctl start && make installcheck-parallel > > and what I see is that the first reported autovacuum activity begins > exactly one minute after the postmaster starts, which is what I'd > expect given the autovacuum naptime rules. It'd not necessarily have to be autovacuum. A CREATE INDEX or VACUUM using parallelism, could also cause this, I think. It'd be a narrow window, of course... Does sparc have wider alignment rules for some types? Perhaps that'd be enough to put some tables to be sufficiently larger to trigger parallel vacuum? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Noah, any chance you could enable log_autovacuum_min_duration=0 on > wrasse? +1 > Does sparc have wider alignment rules for some types? Perhaps that'd be > enough to put some tables to be sufficiently larger to trigger parallel > vacuum? No, the configure results on wrasse look pretty ordinary: checking size of void *... 8 checking size of size_t... 8 checking size of long... 8 checking alignment of short... 2 checking alignment of int... 4 checking alignment of long... 8 checking alignment of double... 8 I wondered for a moment about force_parallel_mode, but wrasse doesn't appear to be setting that, and in any case I'm pretty sure it only affects plannable statements. regards, tom lane
On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote: > Noah, any chance you could enable log_autovacuum_min_duration=0 on > wrasse? Done. Also forced hourly builds.
Noah Misch <noah@leadboat.com> writes: > On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote: >> Noah, any chance you could enable log_autovacuum_min_duration=0 on >> wrasse? > Done. Also forced hourly builds. Thanks! We now have two failing runs with the additional info [1][2], and in both, it's clear that the first autovac worker doesn't launch until 1 minute after postmaster start, by which time we're long done with the test scripts of interest. So whatever is breaking this is not an autovac worker. I think I'm going to temporarily add a couple of queries to check what tenk1's relallvisible actually is, just so we can confirm positively that that's what's causing the plan change. (I'm also curious about whether the CREATE INDEX steps manage to change it at all.) regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-14%2013%3A28%3A14 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-14%2004%3A48%3A13
Hi, On 2022-04-14 12:01:23 -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Wed, Apr 13, 2022 at 06:51:12PM -0700, Andres Freund wrote: > >> Noah, any chance you could enable log_autovacuum_min_duration=0 on > >> wrasse? > > > Done. Also forced hourly builds. Thanks! Can you repro the problem manually on wrasse, perhaps even outside the buildfarm script? That might be simpler than debugging via the BF... > Thanks! We now have two failing runs with the additional info [1][2], > and in both, it's clear that the first autovac worker doesn't launch > until 1 minute after postmaster start, by which time we're long done > with the test scripts of interest. So whatever is breaking this is > not an autovac worker. I did some experiments around that too, and didn't find any related problems. For a second I was wondering if it's caused by the time of initdb (which ends up with a working pgstat snapshot now, but didn't before), but that's just a few more seconds. While the BF scripts don't show timestamps for initdb, the previous step's log output confirms that it's just a few seconds... > I think I'm going to temporarily add a couple of queries to check > what tenk1's relallvisible actually is, just so we can confirm > positively that that's what's causing the plan change. (I'm also > curious about whether the CREATE INDEX steps manage to change it > at all.) I wonder if we should make VACUUM log the VERBOSE output at DEBUG1 unconditionally. This is like the third bug where we needed that information, and it's practically impossible to include in regression output. Then we'd know what the xid horizon is, whether pages were skipped, etc. It also just generally seems like a good thing. Greetings, Andres Freund
On Thu, Apr 14, 2022 at 9:18 AM Andres Freund <andres@anarazel.de> wrote: > I wonder if we should make VACUUM log the VERBOSE output at DEBUG1 > unconditionally. This is like the third bug where we needed that > information, and it's practically impossible to include in regression > output. Then we'd know what the xid horizon is, whether pages were > skipped, etc. I like the idea of making VACUUM log the VERBOSE output as a configurable user-visible feature. We'll then be able to log all VACUUM statements (not just autovacuum worker VACUUMs). -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > Thanks! Can you repro the problem manually on wrasse, perhaps even > outside the buildfarm script? I'm working on that right now, actually... > I wonder if we should make VACUUM log the VERBOSE output at DEBUG1 > unconditionally. This is like the third bug where we needed that > information, and it's practically impossible to include in regression > output. Then we'd know what the xid horizon is, whether pages were > skipped, etc. Right at the moment it seems like we also need visibility into what CREATE INDEX is doing. I'm not sure I'd buy into permanent changes here (at least not ones made in haste), but temporarily adding more logging seems perfectly reasonable. regards, tom lane
Hi, On 2022-04-14 12:26:20 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Thanks! Can you repro the problem manually on wrasse, perhaps even > > outside the buildfarm script? Ah, cool. > I'm working on that right now, actually... > > > I wonder if we should make VACUUM log the VERBOSE output at DEBUG1 > > unconditionally. This is like the third bug where we needed that > > information, and it's practically impossible to include in regression > > output. Then we'd know what the xid horizon is, whether pages were > > skipped, etc. > > Right at the moment it seems like we also need visibility into what > CREATE INDEX is doing. > I'm not sure I'd buy into permanent changes here (at least not ones made > in haste), but temporarily adding more logging seems perfectly reasonable. I think it might be worth leaving in, but let's debate that separately? I'm thinking of something like the attached. Greetings, Andres Freund
Attachment
On Thu, Apr 14, 2022 at 9:48 AM Andres Freund <andres@anarazel.de> wrote: > I think it might be worth leaving in, but let's debate that separately? > I'm thinking of something like the attached. The current convention for the "extra" ereport()s that VACUUM VERBOSE outputs at INFO elevel is to use DEBUG2 elevel in all other cases (these extra messages are considered part of VACUUM VERBOSE output, but are *not* considered part of the autovacuum log output). It looks like you're changing the elevel convention for these "extra" messages with this patch. That might be fine, but don't forget about similar ereports() in vacuumparallel.c. I think that the elevel should probably remain uniform across all of these messages. Though I don't particular care if it's DEBUG2 or DEBUG5. -- Peter Geoghegan
On Thu, Apr 14, 2022 at 10:07 AM Peter Geoghegan <pg@bowt.ie> wrote: > It looks like you're changing the elevel convention for these "extra" > messages with this patch. That might be fine, but don't forget about > similar ereports() in vacuumparallel.c. I think that the elevel should > probably remain uniform across all of these messages. Though I don't > particular care if it's DEBUG2 or DEBUG5. Also, don't forget to do something here, with the assertion and with the message: if (verbose) { /* * Aggressiveness already reported earlier, in dedicated * VACUUM VERBOSE ereport */ Assert(!params->is_wraparound); msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n"); } else if (params->is_wraparound) { Presumably we will need to report on antiwraparound-ness in the verbose-debug-elevel-for-autovacuum case (and not allow this assertion to fail). -- Peter Geoghegan
I wroteL > Andres Freund <andres@anarazel.de> writes: >> Thanks! Can you repro the problem manually on wrasse, perhaps even >> outside the buildfarm script? > I'm working on that right now, actually... So far, reproducing it manually has been a miserable failure: I've run about 180 cycles of the core regression tests with no error. Not sure what's different between my test scenario and wrasse's. Meanwhile, wrasse did fail with my relallvisible check in place [1], and what that shows is that relallvisible is *zero* to start with and remains so throughout the CREATE INDEX sequence. That pretty definitively proves that it's not a page-skipping problem but an xmin-horizon-too-old problem. We're no closer to understanding where that horizon value is coming from, though. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-14%2019%3A28%3A12
On Thu, Apr 14, 2022 at 2:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Meanwhile, wrasse did fail with my relallvisible check in place [1], > and what that shows is that relallvisible is *zero* to start with > and remains so throughout the CREATE INDEX sequence. That pretty > definitively proves that it's not a page-skipping problem but > an xmin-horizon-too-old problem. We're no closer to understanding > where that horizon value is coming from, though. Have you looked at the autovacuum log output in more detail? It might be possible to debug further, but looks like there are no XIDs to work off of in the log_line_prefix that's in use on wrasse. The CITester log_line_prefix is pretty useful -- I wonder if we can standardize on that within the buildfarm, too. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Have you looked at the autovacuum log output in more detail? I don't think there's anything to be learned there. The first autovacuum in wrasse's log happens long after things went south: 2022-04-14 22:49:15.177 CEST [9427:1] LOG: automatic vacuum of table "regression.pg_catalog.pg_type": index scans: 1 pages: 0 removed, 49 remain, 49 scanned (100.00% of total) tuples: 539 removed, 1112 remain, 0 are dead but not yet removable removable cutoff: 8915, older by 1 xids when operation ended index scan needed: 34 pages from table (69.39% of total) had 1107 dead item identifiers removed index "pg_type_oid_index": pages: 14 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_type_typname_nsp_index": pages: 13 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s buffer usage: 193 hits, 0 misses, 0 dirtied WAL usage: 116 records, 0 full page images, 14113 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s If we captured equivalent output from the manual VACUUM in test_setup, maybe something could be learned. However, it seems virtually certain to me that the problematic xmin is in some background process (eg autovac launcher) and thus wouldn't show up in the postmaster log, log_line_prefix or no. regards, tom lane
On Thu, Apr 14, 2022 at 3:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If we captured equivalent output from the manual VACUUM in test_setup, > maybe something could be learned. However, it seems virtually certain > to me that the problematic xmin is in some background process > (eg autovac launcher) and thus wouldn't show up in the postmaster log, > log_line_prefix or no. A bunch of autovacuums that ran between "2022-04-14 22:49:16.274" and "2022-04-14 22:49:19.088" all have the same "removable cutoff". The logs from this time show a period of around three seconds (likely more) where something held back OldestXmin generally. That does seem a bit fishy to me, even though it happened about a minute after the failure itself took place. -- Peter Geoghegan
On Thu, Apr 14, 2022 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > A bunch of autovacuums that ran between "2022-04-14 22:49:16.274" and > "2022-04-14 22:49:19.088" all have the same "removable cutoff". Are you aware of Andres' commit 02fea8fd? That work prevented exactly the same set of symptoms (the same index-only scan create_index regressions), which was apparently necessary following the rearrangements to the regression tests to remove cross-script dependencies (Tom's commit cc50080a82). This was the thread that led to Andres' commit, which was just over a month ago: https://www.postgresql.org/message-id/flat/CAJ7c6TPJNof1Q+vJsy3QebgbPgXdu2ErPvYkBdhD6_Ckv5EZRg@mail.gmail.com -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Are you aware of Andres' commit 02fea8fd? That work prevented exactly > the same set of symptoms (the same index-only scan create_index > regressions), Hm. I'm starting to get the feeling that the real problem here is we've "optimized" the system to the point where repeatable results from VACUUM are impossible :-( regards, tom lane
On Thu, Apr 14, 2022 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm. I'm starting to get the feeling that the real problem here is > we've "optimized" the system to the point where repeatable results > from VACUUM are impossible :-( I don't think that there is any fundamental reason why VACUUM cannot have repeatable results. Anyway, I suppose it's possible that problems reappeared here due to some other patch. Something else could have broken Andres' earlier band aid solution (which was to set synchronous_commit=on in test_setup). Is there any patch that could plausibly have had that effect, whose commit fits with our timeline for the problems seen on wrasse? -- Peter Geoghegan
Hi, On 2022-04-14 21:32:27 -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > Are you aware of Andres' commit 02fea8fd? That work prevented exactly > > the same set of symptoms (the same index-only scan create_index > > regressions), > > Hm. I'm starting to get the feeling that the real problem here is > we've "optimized" the system to the point where repeatable results > from VACUUM are impossible :-( The synchronous_commit issue is an old one. It might actually be worth addressing it by flushing out pending async commits out instead. It just started to be noticeable when tenk1 load and vacuum were moved closer. What do you think about applying a polished version of what I posted in https://postgr.es/m/20220414164830.63rk5zqsvtqqk7qz%40alap3.anarazel.de ? That'd tell us a bit more about the horizon etc. It doesn't have to be the autovacuum launcher. I think it shouldn't even be taken into account - it's not database connected, and tenk1 isn't a shared relation. All very odd. It's also interesting that it only happens in the installcheck cases, afaics, not the check ones. Although that might just be because there's more of them... Greetings, Andres Freund
Peter Geoghegan <pg@bowt.ie> writes: > Anyway, I suppose it's possible that problems reappeared here due to > some other patch. Something else could have broken Andres' earlier > band aid solution (which was to set synchronous_commit=on in > test_setup). That band-aid only addressed the situation of someone having turned off synchronous_commit in the first place; which is not the case on wrasse or most/all other buildfarm animals. Whatever we're dealing with here is something independent of that. > Is there any patch that could plausibly have had that effect, whose > commit fits with our timeline for the problems seen on wrasse? I already enumerated my suspects, back at the top of this thread. regards, tom lane
On Thu, Apr 14, 2022 at 6:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > That band-aid only addressed the situation of someone having turned > off synchronous_commit in the first place; which is not the case > on wrasse or most/all other buildfarm animals. Whatever we're > dealing with here is something independent of that. That was the intent, but that in itself doesn't mean that it isn't something to do with setting hint bits (not the OldestXmin horizon being held back). I'd really like to know what the removable cutoff looks like for these VACUUM operations, which is something like Andres' VACUUM VERBOSE debug patch should tell us. > > Is there any patch that could plausibly have had that effect, whose > > commit fits with our timeline for the problems seen on wrasse? > > I already enumerated my suspects, back at the top of this thread. Right, but I thought that the syncronous_commit thing was new information that made that worth revisiting. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > That was the intent, but that in itself doesn't mean that it isn't > something to do with setting hint bits (not the OldestXmin horizon > being held back). Oh! You mean that maybe the OldestXmin horizon was fine, but something decided not to update hint bits (and therefore also not the all-visible bit) anyway? Worth investigating I guess. > I'd really like to know what the removable cutoff > looks like for these VACUUM operations, which is something like > Andres' VACUUM VERBOSE debug patch should tell us. Yeah. I'd hoped to investigate this manually and not have to clutter the main repo with debugging commits. However, since I've still utterly failed to reproduce the problem on wrasse's host, I think we're going to be forced to do that. regards, tom lane
On Thu, Apr 14, 2022 at 7:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh! You mean that maybe the OldestXmin horizon was fine, but something > decided not to update hint bits (and therefore also not the all-visible > bit) anyway? Worth investigating I guess. Yes. That is starting to seem like a plausible alternative explanation. > > I'd really like to know what the removable cutoff > > looks like for these VACUUM operations, which is something like > > Andres' VACUUM VERBOSE debug patch should tell us. > > Yeah. I'd hoped to investigate this manually and not have to clutter > the main repo with debugging commits. Suppose that the bug was actually in 06f5295af6, "Add single-item cache when looking at topmost XID of a subtrans XID". Doesn't that fit your timeline just as well? I haven't really started to investigate that theory (just putting dinner on here). Just a wild guess at this point. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Suppose that the bug was actually in 06f5295af6, "Add single-item > cache when looking at topmost XID of a subtrans XID". Doesn't that fit > your timeline just as well? I'd dismissed that on the grounds that there are no subtrans XIDs involved in tenk1's contents. However, if that patch was faulty enough, maybe it affected other cases besides the advertised one? I've not read it. regards, tom lane
On Thu, Apr 14, 2022 at 06:52:49PM -0700, Andres Freund wrote: > On 2022-04-14 21:32:27 -0400, Tom Lane wrote: > > Peter Geoghegan <pg@bowt.ie> writes: > > > Are you aware of Andres' commit 02fea8fd? That work prevented exactly > > > the same set of symptoms (the same index-only scan create_index > > > regressions), > > > > Hm. I'm starting to get the feeling that the real problem here is > > we've "optimized" the system to the point where repeatable results > > from VACUUM are impossible :-( > > The synchronous_commit issue is an old one. It might actually be worth > addressing it by flushing out pending async commits out instead. It just > started to be noticeable when tenk1 load and vacuum were moved closer. > > > What do you think about applying a polished version of what I posted in > https://postgr.es/m/20220414164830.63rk5zqsvtqqk7qz%40alap3.anarazel.de > ? That'd tell us a bit more about the horizon etc. No objection. > It's also interesting that it only happens in the installcheck cases, > afaics, not the check ones. Although that might just be because there's > more of them... I suspect the failure is somehow impossible in "check". Yesterday, I cranked up the number of locales, so there are now a lot more installcheck. Before that, each farm run had one "check" and two "installcheck". Those days saw ten installcheck failures, zero check failures. Like Tom, I'm failing to reproduce this outside the buildfarm client. I wrote a shell script to closely resemble the buildfarm installcheck sequence, but it's lasted a dozen runs without failing.
On Thu, Apr 14, 2022 at 07:45:15PM -0700, Noah Misch wrote: > On Thu, Apr 14, 2022 at 06:52:49PM -0700, Andres Freund wrote: > > On 2022-04-14 21:32:27 -0400, Tom Lane wrote: > > > Peter Geoghegan <pg@bowt.ie> writes: > > > > Are you aware of Andres' commit 02fea8fd? That work prevented exactly > > > > the same set of symptoms (the same index-only scan create_index > > > > regressions), > > > > > > Hm. I'm starting to get the feeling that the real problem here is > > > we've "optimized" the system to the point where repeatable results > > > from VACUUM are impossible :-( > > > > The synchronous_commit issue is an old one. It might actually be worth > > addressing it by flushing out pending async commits out instead. It just > > started to be noticeable when tenk1 load and vacuum were moved closer. > > > > > > What do you think about applying a polished version of what I posted in > > https://postgr.es/m/20220414164830.63rk5zqsvtqqk7qz%40alap3.anarazel.de > > ? That'd tell us a bit more about the horizon etc. > > No objection. > > > It's also interesting that it only happens in the installcheck cases, > > afaics, not the check ones. Although that might just be because there's > > more of them... > > I suspect the failure is somehow impossible in "check". Yesterday, I cranked > up the number of locales, so there are now a lot more installcheck. Before > that, each farm run had one "check" and two "installcheck". Those days saw > ten installcheck failures, zero check failures. > > Like Tom, I'm failing to reproduce this outside the buildfarm client. I wrote > a shell script to closely resemble the buildfarm installcheck sequence, but > it's lasted a dozen runs without failing. But 24s after that email, it did reproduce the problem. Same symptoms as the last buildfarm runs, including visfrac=0. I'm attaching my script. (It has various references to my home directory, so it's not self-contained.)
Attachment
Noah Misch <noah@leadboat.com> writes: > Like Tom, I'm failing to reproduce this outside the buildfarm client. This is far from the first time that I've failed to reproduce a buildfarm result manually, even on the very machine hosting the animal. I would like to identify the cause(s) of that. One obvious theory is that the environment under a cron job is different --- but the only thing I know of that should be different is possibly nice'ing the job priorities. I did try a fair number of test cycles under "nice" in this case. Anybody have other ideas? regards, tom lane
Noah Misch <noah@leadboat.com> writes: > But 24s after that email, it did reproduce the problem. Ain't that always the way. > Same symptoms as the > last buildfarm runs, including visfrac=0. I'm attaching my script. (It has > various references to my home directory, so it's not self-contained.) That's interesting, because I see you used installcheck-parallel, which I'd not been using on the grounds that wrasse isn't parallelizing these runs. That puts a big hole in my working assumption that the problem is one of timing. I'm now suspecting that the key issue is something about how wrasse is building the executables that I did not exactly reproduce. I did not try to copy the build details involving stuff under your home directory (like the private openldap version), mainly because it didn't seem like openldap or uuid or perl could be involved at all. But maybe somehow? regards, tom lane
On Thu, Apr 14, 2022 at 7:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is far from the first time that I've failed to reproduce a buildfarm > result manually, even on the very machine hosting the animal. I would > like to identify the cause(s) of that. One obvious theory is that the > environment under a cron job is different --- but the only thing I know > of that should be different is possibly nice'ing the job priorities. > I did try a fair number of test cycles under "nice" in this case. > Anybody have other ideas? Well, Noah is running wrasse with 'fsync = off'. And did so in the script as well. That seems like it definitely could matter. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Well, Noah is running wrasse with 'fsync = off'. And did so in the > script as well. As am I. I duplicated wrasse's config to the extent of cat >>$PGDATA/postgresql.conf <<EOF log_line_prefix = '%m [%p:%l] %q%a ' log_connections = 'true' log_disconnections = 'true' log_statement = 'all' fsync = off log_autovacuum_min_duration = 0 EOF One thing I'm eyeing now is that it looks like Noah is re-initdb'ing each time, whereas I'd just stopped and started the postmaster of an existing installation. That does not seem like it could matter but ... regards, tom lane
On Thu, Apr 14, 2022 at 11:06:04PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > But 24s after that email, it did reproduce the problem. > > Ain't that always the way. Quite so. > > Same symptoms as the > > last buildfarm runs, including visfrac=0. I'm attaching my script. (It has > > various references to my home directory, so it's not self-contained.) > > That's interesting, because I see you used installcheck-parallel, > which I'd not been using on the grounds that wrasse isn't parallelizing > these runs. That puts a big hole in my working assumption that the > problem is one of timing. With "make installcheck-tests TESTS='test_setup create_index'" it remains reproducible. The attached script reproduced it in 103s and then in 703s. > I'm now suspecting that the key issue is > something about how wrasse is building the executables that I did > not exactly reproduce. I did not try to copy the build details > involving stuff under your home directory (like the private openldap > version), mainly because it didn't seem like openldap or uuid or > perl could be involved at all. But maybe somehow? Can't rule it out entirely. I think I've now put world-read on all the directories referenced in the script, in the event you'd like to use them.
Attachment
I wrote: > One thing I'm eyeing now is that it looks like Noah is re-initdb'ing > each time, whereas I'd just stopped and started the postmaster of > an existing installation. That does not seem like it could matter > but ... Well, damn. I changed my script that way and it failed on the tenth iteration (versus a couple hundred successful iterations the other way). So somehow this is related to time-since-initdb, not time-since-postmaster-start. Any ideas? Anyway, I'm too tired to do more tonight, but now that I can reproduce it I will stick some debugging logic in tomorrow. I no longer think we should clutter the git repo with any more short-term hacks. regards, tom lane PS a bit later: I've not yet reproduced it a second time, so the failure rate is unfortunately a lot less than one-in-ten. Still, this eliminates the idea that there's some secret sauce in Noah's build details.
Hi, On 2022-04-14 22:40:51 -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > Suppose that the bug was actually in 06f5295af6, "Add single-item > > cache when looking at topmost XID of a subtrans XID". Doesn't that fit > > your timeline just as well? > > I'd dismissed that on the grounds that there are no subtrans XIDs > involved in tenk1's contents. However, if that patch was faulty > enough, maybe it affected other cases besides the advertised one? > I've not read it. I was planning to complain about that commit, fwiw. Without so much as an assertion verifying the cache is correct it seems quite dangerous to me. And looking at it, it has obvious wraparound issues... But that can't matter here, obviously. We also reach SubTransGetTopmostTransaction() from XactLockTableWait() but I don't quite see how we reach that here either... Greetings, Andres Freund
On Thu, Apr 14, 2022 at 11:56:15PM -0400, Tom Lane wrote: > Anyway, I'm too tired to do more tonight, but now that I can reproduce it > I will stick some debugging logic in tomorrow. I no longer think we > should clutter the git repo with any more short-term hacks. Sounds good. I've turned off the wrasse buildfarm client for the moment. It's less useful than your local setup, and they'd compete for resources.
Hi, On 2022-04-14 23:56:15 -0400, Tom Lane wrote: > I wrote: > > One thing I'm eyeing now is that it looks like Noah is re-initdb'ing > > each time, whereas I'd just stopped and started the postmaster of > > an existing installation. That does not seem like it could matter > > but ... > > Well, damn. I changed my script that way and it failed on the tenth > iteration (versus a couple hundred successful iterations the other > way). Just to make sure: This is also on wrasse? What DSM backend do we end up with on solaris? With shared memory stats we're using DSM a lot earlier and more commonly than before. Another thing that might be worth trying is to enable checksums. I've caught weird bugs with that in the past. And it's possible that bgwriter writes out a page that we then read back in quickly after, or something like that. > So somehow this is related to time-since-initdb, not > time-since-postmaster-start. Any ideas? Perhaps it makes a difference that we start with a "young" database xid age wise? We've had bugs around subtracting xids and ending up on some special one in the past. Greetings, Andres Freund
Hi, On 2022-04-14 19:45:15 -0700, Noah Misch wrote: > I suspect the failure is somehow impossible in "check". Yesterday, I cranked > up the number of locales, so there are now a lot more installcheck. Before > that, each farm run had one "check" and two "installcheck". Those days saw > ten installcheck failures, zero check failures. I notice that the buildfarm appears to run initdb with syncing enabled ("syncing data to disk ... ok" in the initdb steps). Whereas pg_regress uses --no-sync. I wonder if that's what makes the difference? Now that you reproduced it, does it still reproduce with --no-sync added? Also worth noting that pg_regress doesn't go through pg_ctl... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-04-14 23:56:15 -0400, Tom Lane wrote: >> Well, damn. I changed my script that way and it failed on the tenth >> iteration (versus a couple hundred successful iterations the other >> way). > Just to make sure: This is also on wrasse? Right, gcc211 with a moderately close approximation to wrasse's build details. Why that shows the problem when we've not seen it elsewhere remains to be seen. > What DSM backend do we end up with on solaris? With shared memory stats > we're using DSM a lot earlier and more commonly than before. That ... is an interesting point. It seems to be just "posix" though. >> So somehow this is related to time-since-initdb, not >> time-since-postmaster-start. Any ideas? > Perhaps it makes a difference that we start with a "young" database xid > age wise? We've had bugs around subtracting xids and ending up on some > special one in the past. It does seem like it's got to be related to small XID and/or small LSN values. No clue right now, but more news tomorrow, I hope. regards, tom lane
On Thu, Apr 14, 2022 at 10:12:05PM -0700, Andres Freund wrote: > On 2022-04-14 19:45:15 -0700, Noah Misch wrote: > > I suspect the failure is somehow impossible in "check". Yesterday, I cranked > > up the number of locales, so there are now a lot more installcheck. Before > > that, each farm run had one "check" and two "installcheck". Those days saw > > ten installcheck failures, zero check failures. > > I notice that the buildfarm appears to run initdb with syncing enabled > ("syncing data to disk ... ok" in the initdb steps). Whereas pg_regress > uses --no-sync. Yep. > I wonder if that's what makes the difference? Now that you reproduced > it, does it still reproduce with --no-sync added? It does; the last version of my script used "initdb -N ...". > Also worth noting that pg_regress doesn't go through pg_ctl... Hmmm.
The morning's first result is that during a failing run, the vacuum in test_setup sees 2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1; 2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1" 2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; 2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": index scans:0 pages: 0 removed, 345 remain, 345 scanned (100.00% of total) tuples: 0 removed, 10000 remain, 0 are dead but not yet removable removable cutoff: 724, older by 26 xids when operation ended index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s buffer usage: 695 hits, 2 misses, 2 dirtied WAL usage: 1 records, 0 full page images, 188 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; OldestXmin = 724 is too old to consider tenk1's contents as all-visible: regression=# select distinct xmin from tenk1; xmin ------ 749 (1 row) In fact, right after initdb pg_controldata shows Latest checkpoint's NextXID: 0:724 Latest checkpoint's oldestXID: 716 So there's no longer any doubt that something is holding back OldestXmin. I will go put some instrumentation into the code that's computing that. regards, tom lane
Hi, On 2022-04-15 10:15:32 -0400, Tom Lane wrote: > The morning's first result is that during a failing run, > the vacuum in test_setup sees > > 2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1; > 2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1" > 2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; > 2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": indexscans: 0 > pages: 0 removed, 345 remain, 345 scanned (100.00% of total) > tuples: 0 removed, 10000 remain, 0 are dead but not yet removable > removable cutoff: 724, older by 26 xids when operation ended > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed > avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s > buffer usage: 695 hits, 2 misses, 2 dirtied > WAL usage: 1 records, 0 full page images, 188 bytes > system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > 2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; The horizon advancing by 26 xids during tenk1's vacuum seems like quite a bit, given there's no normal concurrent activity during test_setup. > In fact, right after initdb pg_controldata shows > Latest checkpoint's NextXID: 0:724 > Latest checkpoint's oldestXID: 716 So that's the xmin that e.g. the autovac launcher ends up with during start... If I make get_database_list() sleep for 5s within the scan, I can reproduce on x86-64. Greetings, Andres Freund
I wrote: > the vacuum in test_setup sees > ... > removable cutoff: 724, older by 26 xids when operation ended > ... BTW, before I forget: the wording of this log message is just awful. On first sight, I thought that it meant that we'd computed OldestXmin a second time and discovered that it advanced by 26 xids while the VACUUM was running. Looking at the code, I see that's not so: diff = (int32) (ReadNextTransactionId() - OldestXmin); appendStringInfo(&buf, _("removable cutoff: %u, older by %d xids when operation ended\n"), OldestXmin, diff); but good luck understanding what it actually means from the message text alone. I think more appropriate wording would be something like "removable cutoff: %u, which was %d xids old when operation ended\n" Also, is it really our practice to spell XID in lower-case in user-facing messages? Thoughts, better ideas? regards, tom lane
Hi, On April 15, 2022 11:12:10 AM EDT, Andres Freund <andres@anarazel.de> wrote: >Hi, > >On 2022-04-15 10:15:32 -0400, Tom Lane wrote: >> The morning's first result is that during a failing run, >> the vacuum in test_setup sees >> >> 2022-04-15 16:01:43.064 CEST [4436:75] pg_regress/test_setup LOG: statement: VACUUM ANALYZE tenk1; >> 2022-04-15 16:01:43.064 CEST [4436:76] pg_regress/test_setup LOG: vacuuming "regression.public.tenk1" >> 2022-04-15 16:01:43.064 CEST [4436:77] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; >> 2022-04-15 16:01:43.071 CEST [4436:78] pg_regress/test_setup LOG: finished vacuuming "regression.public.tenk1": indexscans: 0 >> pages: 0 removed, 345 remain, 345 scanned (100.00% of total) >> tuples: 0 removed, 10000 remain, 0 are dead but not yet removable >> removable cutoff: 724, older by 26 xids when operation ended >> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed >> avg read rate: 2.189 MB/s, avg write rate: 2.189 MB/s >> buffer usage: 695 hits, 2 misses, 2 dirtied >> WAL usage: 1 records, 0 full page images, 188 bytes >> system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s >> 2022-04-15 16:01:43.071 CEST [4436:79] pg_regress/test_setup STATEMENT: VACUUM ANALYZE tenk1; > >The horizon advancing by 26 xids during tenk1's vacuum seems like quite >a bit, given there's no normal concurrent activity during test_setup. > > >> In fact, right after initdb pg_controldata shows >> Latest checkpoint's NextXID: 0:724 >> Latest checkpoint's oldestXID: 716 > >So that's the xmin that e.g. the autovac launcher ends up with during >start... > >If I make get_database_list() sleep for 5s within the scan, I can >reproduce on x86-64. Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
I wrote: > So there's no longer any doubt that something is holding back OldestXmin. > I will go put some instrumentation into the code that's computing that. The something is the logical replication launcher. In the failing runs, it is advertising xmin = 724 (the post-initdb NextXID) and continues to do so well past the point where tenk1 gets vacuumed. Discuss. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated... Yeah. I think this bit in ComputeXidHorizons needs rethinking: /* * Normally queries in other databases are ignored for anything but * the shared horizon. ... */ if (in_recovery || MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId || proc->databaseId == 0) /* always include WalSender */ { The "proc->databaseId == 0" business apparently means to include only walsender processes, and it's broken because that condition doesn't include only walsender processes. At this point we have the following conclusions: 1. A slow transaction in the launcher's initial get_database_list() call fully explains these failures. (I had been thinking that the launcher's xact would have to persist as far as the create_index script, but that's not so: it only has to last until test_setup begins vacuuming tenk1. The CREATE INDEX steps are not doing any visibility map changes of their own, but what they are doing is updating relallvisible from the results of visibilitymap_count(). That's why they undid the effects of manually poking relallvisible, without actually inserting any data better than what the initial VACUUM computed.) 2. We can probably explain why only wrasse sees this as some quirk of the Solaris scheduler. I'm satisfied to blame it-happens-in- installcheck-but-not-check on that too. 3. It remains unclear why we suddenly started seeing this last week. I suppose it has to be a side-effect of the pgstats changes, but the mechanism is obscure. Probably not worth the effort to pin down exactly why. As for fixing it, what I think would be the preferable answer is to fix the above-quoted logic so that it indeed includes only walsenders and not random other background workers. (Why does it need to include walsenders, anyway? The commentary sucks.) Alternatively, or perhaps also, we could do what was discussed previously and make a hack to allow delaying vacuum until the system is quiescent. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2022-04-15 10:15:32 -0400, Tom Lane wrote: >> removable cutoff: 724, older by 26 xids when operation ended > The horizon advancing by 26 xids during tenk1's vacuum seems like quite > a bit, given there's no normal concurrent activity during test_setup. Hah, so you were taken in by this wording too. See my complaint about it downthread. regards, tom lane
Hi, On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >I wrote: >> So there's no longer any doubt that something is holding back OldestXmin. >> I will go put some instrumentation into the code that's computing that. > >The something is the logical replication launcher. In the failing runs, >it is advertising xmin = 724 (the post-initdb NextXID) and continues to >do so well past the point where tenk1 gets vacuumed. > >Discuss. That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous.So the issue probably existed before, just unlikely to ever be reached. We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way.But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizonfor non-shared rels. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, before I forget: the wording of this log message is just awful. > On first sight, I thought that it meant that we'd computed OldestXmin > a second time and discovered that it advanced by 26 xids while the VACUUM > was running. > "removable cutoff: %u, which was %d xids old when operation ended\n" How the output appears when placed right before the output describing how VACUUM advanced relfrozenxid is an important consideration. I want the format and wording that we use to imply a relationship between these two things. Right now, that other line looks like this: "new relfrozenxid: %u, which is %d xids ahead of previous value\n" Do you think that this juxtaposition works well? > Also, is it really our practice to spell XID in lower-case in > user-facing messages? There are examples of both. This could easily be changed to "XIDs". -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The something is the logical replication launcher. In the failing runs, >> it is advertising xmin = 724 (the post-initdb NextXID) and continues to >> do so well past the point where tenk1 gets vacuumed. > That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous.So the issue probably existed before, just unlikely to ever be reached. Um, this is the logical replication launcher, not the autovac launcher. Your observation that a sleep in get_database_list() reproduces it confirms that, and I don't entirely see why the timing of the LR launcher would have changed. (On thinking about it, I suppose the AV launcher might trigger this too, but that is not the PID I saw in testing.) > We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback that way.But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizonfor non-shared rels. Yeah, I was also thinking about a flag in PGPROC being a more reliable way to do this. Is there anything besides walsenders that should set that flag? regards, tom lane
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, before I forget: the wording of this log message is just awful. >> [ so how about ] >> "removable cutoff: %u, which was %d xids old when operation ended\n" > How the output appears when placed right before the output describing > how VACUUM advanced relfrozenxid is an important consideration. I want > the format and wording that we use to imply a relationship between > these two things. Right now, that other line looks like this: > "new relfrozenxid: %u, which is %d xids ahead of previous value\n" > Do you think that this juxtaposition works well? Seems all right to me; do you have a better suggestion? regards, tom lane
On Fri, Apr 15, 2022 at 9:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Do you think that this juxtaposition works well? > > Seems all right to me; do you have a better suggestion? No. At first I thought that mixing "which is" and "which was" wasn't quite right. I changed my mind, though. Your new wording is fine. I'll update the log output some time today. -- Peter Geoghegan
Hi, On 2022-04-15 12:36:52 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The something is the logical replication launcher. In the failing runs, > >> it is advertising xmin = 724 (the post-initdb NextXID) and continues to > >> do so well past the point where tenk1 gets vacuumed. > > > That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous.So the issue probably existed before, just unlikely to ever be reached. > > Um, this is the logical replication launcher, not the autovac > launcher. Short term confusion... > Your observation that a sleep in get_database_list() reproduces it > confirms that I don't understand what you mean here? get_database_list() is autovac launcher code? So being able to reproduce the problem by putting in a sleep there doesn't seem like a confirm anything about the logical rep launcher? > , and I don't entirely see why the timing of the LR launcher > would have changed. Could still be related to the autovac launcher not requesting / pgstats not writing / launcher not reading the stats file(s). That obviously is going to have some scheduler impact. > > We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback thatway. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizonfor non-shared rels. > > Yeah, I was also thinking about a flag in PGPROC being a more reliable > way to do this. Is there anything besides walsenders that should set > that flag? Not that I can think of. It's only because of hs_feedback that we need to. I guess it's possible that somebody build some extension that needs something similar, but then they'd need to set that flag... Greetings, Andres Freund
Hi, (Sent again, somehow my editor started to sometimes screw up mail headers, and ate the From:, sorry for the duplicate) On 2022-04-15 12:36:52 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On April 15, 2022 11:23:40 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The something is the logical replication launcher. In the failing runs, > >> it is advertising xmin = 724 (the post-initdb NextXID) and continues to > >> do so well past the point where tenk1 gets vacuumed. > > > That explains it. Before shmstat autovac needed to wait for the stats collector to write out stats. Now it's near instantaneous.So the issue probably existed before, just unlikely to ever be reached. > > Um, this is the logical replication launcher, not the autovac > launcher. Short term confusion... > Your observation that a sleep in get_database_list() reproduces it > confirms that I don't understand what you mean here? get_database_list() is autovac launcher code? So being able to reproduce the problem by putting in a sleep there doesn't seem like a confirm anything about the logical rep launcher? > , and I don't entirely see why the timing of the LR launcher > would have changed. Could still be related to the autovac launcher not requesting / pgstats not writing / launcher not reading the stats file(s). That obviously is going to have some scheduler impact. > > We can't just ignore database less xmins for non-shared rels, because walsender propagates hot_standby_feedback thatway. But we can probably add a flag somewhere indicating whether a database less PGPROC has to be accounted in the horizonfor non-shared rels. > > Yeah, I was also thinking about a flag in PGPROC being a more reliable > way to do this. Is there anything besides walsenders that should set > that flag? Not that I can think of. It's only because of hs_feedback that we need to. I guess it's possible that somebody build some extension that needs something similar, but then they'd need to set that flag... Greetings, Andres Freund
I wrote: > Um, this is the logical replication launcher, not the autovac launcher. > Your observation that a sleep in get_database_list() reproduces it > confirms that, and I don't entirely see why the timing of the LR launcher > would have changed. Oh, to clarify: I misread "get_database_list()" as "get_subscription_list()", which is the part of the LR launcher startup that causes the problem for me. So what we actually have confirmed is that BOTH of those launchers are problematic for this. And AFAICS neither of them needs to be causing horizon adjustments for non-shared tables. (It's possible that the AV launcher is responsible in some of wrasse's reports, but it's been the LR launcher in five out of five sufficiently-instrumented failures for me.) regards, tom lane
Hi, On 2022-04-15 09:29:20 -0700, Peter Geoghegan wrote: > On Fri, Apr 15, 2022 at 8:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > BTW, before I forget: the wording of this log message is just awful. > > On first sight, I thought that it meant that we'd computed OldestXmin > > a second time and discovered that it advanced by 26 xids while the VACUUM > > was running. > > > "removable cutoff: %u, which was %d xids old when operation ended\n" > > How the output appears when placed right before the output describing > how VACUUM advanced relfrozenxid is an important consideration. I want > the format and wording that we use to imply a relationship between > these two things. Right now, that other line looks like this: > > "new relfrozenxid: %u, which is %d xids ahead of previous value\n" > > Do you think that this juxtaposition works well? I don't think they're actually that comparable. One shows how much relfrozenxid advanced, to a large degree influenced by the time between aggressive (or "unintentionally aggressive") vacuums. The other shows the age of OldestXmin at the end of the vacuum. Which is influenced by what's currently running. Greetings, Andres Freund
On Fri, Apr 15, 2022 at 10:05 AM Andres Freund <andres@anarazel.de> wrote: > I don't think they're actually that comparable. One shows how much > relfrozenxid advanced, to a large degree influenced by the time between > aggressive (or "unintentionally aggressive") vacuums. It matters more in the extreme cases. The most recent possible value for our new relfrozenxid is OldestXmin/removable cutoff. So when something holds back OldestXmin, it also holds back new relfrozenxid values. > The other shows > the age of OldestXmin at the end of the vacuum. Which is influenced by > what's currently running. As well as the age of OldestXmin at the start of VACUUM. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Apr 15, 2022 at 10:05 AM Andres Freund <andres@anarazel.de> wrote: >> The other shows >> the age of OldestXmin at the end of the vacuum. Which is influenced by >> what's currently running. > As well as the age of OldestXmin at the start of VACUUM. Is it worth capturing and logging both of those numbers? Why is the age at the end more interesting than the age at the start? regards, tom lane
On Fri, Apr 15, 2022 at 10:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > As well as the age of OldestXmin at the start of VACUUM. > > Is it worth capturing and logging both of those numbers? Why is > the age at the end more interesting than the age at the start? As Andres said, that's often more interesting because most of the time OldestXmin is not held back by much (not enough to matter). Users will often look at the output of successive related VACUUM operations. Often the way things change over time is much more interesting than the details at any particular point in time. Especially in the kinds of extreme cases I'm thinking about. -- Peter Geoghegan
Hi, On 2022-04-15 10:23:56 -0700, Peter Geoghegan wrote: > On Fri, Apr 15, 2022 at 10:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > As well as the age of OldestXmin at the start of VACUUM. > > > > Is it worth capturing and logging both of those numbers? Why is > > the age at the end more interesting than the age at the start? > > As Andres said, that's often more interesting because most of the time > OldestXmin is not held back by much (not enough to matter). I think it'd be interesting - particularly for large relations or when looking to adjust autovac cost limits. It's not rare for autovac to take long enough that another autovac is necessary immediately again. Also helps to interpret the "dead but not yet removable" counts. Something like: removable cutoff: %u, age at start: %u, age at end: %u... Greetings, Andres Freund
On Fri, Apr 15, 2022 at 10:43 AM Andres Freund <andres@anarazel.de> wrote: > I think it'd be interesting - particularly for large relations or when > looking to adjust autovac cost limits. > Something like: > removable cutoff: %u, age at start: %u, age at end: %u... Part of the problem here is that we determine VACUUM's FreezeLimit by calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]). Why should we do less freezing due to the presence of an old snapshot? Sure, that has to happen with those XIDs that are fundamentally ineligible for freezing due to the presence of the old snapshot -- but what about those XIDs that *are* eligible, and still don't get frozen at first? We should determine FreezeLimit by calculating `NextXID - vacuum_freeze_min_age ` instead (and then clamp, to make sure that it's always <= OldestXmin). That approach would make our final FreezeLimit "strictly age-based". [1] We do something a bit like this when OldestXmin is already very old -- then FreezeLimit is the same value as OldestXmin (see WARNING from vacuum_set_xid_limits() function). That's better than nothing, but doesn't change the fact that our general approach to calculating FreezeLimit makes little sense. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > On 2022-04-15 12:36:52 -0400, Tom Lane wrote: >> Yeah, I was also thinking about a flag in PGPROC being a more reliable >> way to do this. Is there anything besides walsenders that should set >> that flag? > Not that I can think of. It's only because of hs_feedback that we need > to. I guess it's possible that somebody build some extension that needs > something similar, but then they'd need to set that flag... Here's a WIP patch for that. The only exciting thing in it is that because of some undocumented cowboy programming in walsender.c, the Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); in ProcArrayInstallRestoredXmin fires unless we skip that. I could use some help filling in the XXX comments, because it's far from clear to me *why* walsenders need this to happen. regards, tom lane diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index be40261393..a04a4a8e5d 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -285,6 +285,15 @@ InitWalSender(void) MarkPostmasterChildWalSender(); SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE); + /* + * Also, mark ourselves in PGPROC as affecting vacuum horizons in all + * databases. XXX explain why + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + MyProc->statusFlags |= PROC_AFFECTS_ALL_HORIZONS; + ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags; + LWLockRelease(ProcArrayLock); + /* Initialize empty timestamp buffer for lag tracking. */ lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker)); } diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index f6e98aae29..8b867bd56c 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -1821,10 +1821,12 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h) * to prune still needed data away). If the current backend never * connects to a database that is harmless, because * data_oldest_nonremovable will never be utilized. + * + * XXX explain PROC_AFFECTS_ALL_HORIZONS, too */ if (in_recovery || MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId || - proc->databaseId == 0) /* always include WalSender */ + (statusFlags & PROC_AFFECTS_ALL_HORIZONS)) { /* * We can ignore this backend if it's running CREATE INDEX @@ -2681,10 +2683,14 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc) /* Install xmin */ MyProc->xmin = TransactionXmin = xmin; - /* Flags being copied must be valid copy-able flags. */ - Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); - MyProc->statusFlags = proc->statusFlags; - ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags; + /* walsender cheats by passing proc == MyProc, don't check flags */ + if (proc != MyProc) + { + /* Flags being copied must be valid copy-able flags. */ + Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); + MyProc->statusFlags = proc->statusFlags; + ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags; + } result = true; } diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index 809b74db5e..e132665938 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -60,6 +60,9 @@ struct XidCache #define PROC_VACUUM_FOR_WRAPAROUND 0x08 /* set by autovac only */ #define PROC_IN_LOGICAL_DECODING 0x10 /* currently doing logical * decoding outside xact */ +#define PROC_AFFECTS_ALL_HORIZONS 0x20 /* proc's xmin must be + * included in vacuum horizons + * in all databases */ /* flags reset at EOXact */ #define PROC_VACUUM_STATE_MASK \
Hi, On April 15, 2022 2:14:47 PM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> On 2022-04-15 12:36:52 -0400, Tom Lane wrote: >>> Yeah, I was also thinking about a flag in PGPROC being a more reliable >>> way to do this. Is there anything besides walsenders that should set >>> that flag? > >> Not that I can think of. It's only because of hs_feedback that we need >> to. I guess it's possible that somebody build some extension that needs >> something similar, but then they'd need to set that flag... > >Here's a WIP patch for that. The only exciting thing in it is that >because of some undocumented cowboy programming in walsender.c, the > Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); >in ProcArrayInstallRestoredXmin fires unless we skip that. > >I could use some help filling in the XXX comments, because it's far >from clear to me *why* walsenders need this to happen. I'm out for the rest of the day due to family events (visiting my girlfriend's parents till Wednesday), I can take a stabat formulating something after. If you want to commit before: The reason is that walsenders use their xmin to represent the xmin of standbys when usinghot_standby_feedback. Since we're only transmitting global horizons up from standbys, it has to influence globally (andit would be hard to represent per db horizons anyway). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On April 15, 2022 2:14:47 PM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I could use some help filling in the XXX comments, because it's far >> from clear to me *why* walsenders need this to happen. > If you want to commit before: The reason is that walsenders use their xmin to represent the xmin of standbys when usinghot_standby_feedback. Since we're only transmitting global horizons up from standbys, it has to influence globally (andit would be hard to represent per db horizons anyway). Got it. I rewrote the comments and pushed. Noah, it should be safe to turn wrasse back on. regards, tom lane
Hi, On 2022-04-15 11:12:34 -0700, Peter Geoghegan wrote: > On Fri, Apr 15, 2022 at 10:43 AM Andres Freund <andres@anarazel.de> wrote: > > I think it'd be interesting - particularly for large relations or when > > looking to adjust autovac cost limits. > > > Something like: > > removable cutoff: %u, age at start: %u, age at end: %u... > > Part of the problem here is that we determine VACUUM's FreezeLimit by > calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]). What the message outputs is OldestXmin and not FreezeLimit though. And FreezeLimit doesn't affect "dead but not yet removable". IOW, the following might be right, but that seems independent of improving the output of diff = (int32) (ReadNextTransactionId() - OldestXmin); appendStringInfo(&buf, _("removable cutoff: %u, which was %d XIDs old when operation ended\n"), OldestXmin, diff); > Why should we do less freezing due to the presence of an old snapshot? > Sure, that has to happen with those XIDs that are fundamentally > ineligible for freezing due to the presence of the old snapshot -- but > what about those XIDs that *are* eligible, and still don't get frozen > at first? > > We should determine FreezeLimit by calculating `NextXID - > vacuum_freeze_min_age ` instead (and then clamp, to make sure that > it's always <= OldestXmin). That approach would make our final > FreezeLimit "strictly age-based". > > [1] We do something a bit like this when OldestXmin is already very > old -- then FreezeLimit is the same value as OldestXmin (see WARNING > from vacuum_set_xid_limits() function). That's better than nothing, > but doesn't change the fact that our general approach to calculating > FreezeLimit makes little sense. Greetings, Andres Freund
On Sun, Apr 17, 2022 at 7:36 AM Andres Freund <andres@anarazel.de> wrote: > > Part of the problem here is that we determine VACUUM's FreezeLimit by > > calculating `OldestXmin - vacuum_freeze_min_age` (more or less [1]). > > What the message outputs is OldestXmin and not FreezeLimit though. My higher level point is that there is a general tendency to assume that OldestXmin is the same thing as NextXID, which it isn't. It's an easy enough mistake to make, though, in part because they're usually quite close together. The "Routine Vacuuming" docs seem to suggest that they're the same thing, or at least that's what I take away from the following sentence: "This implies that if a table is not otherwise vacuumed, autovacuum will be invoked on it approximately once every autovacuum_freeze_max_age minus vacuum_freeze_min_age transactions". > And FreezeLimit doesn't affect "dead but not yet removable". But OldestXmin affects FreezeLimit. Anyway, I'm not opposed to showing the age at the start as well. But from the point of view of issues like this tenk1 issue, it would be more useful to just report on new_rel_allvisible. It would also be more useful to users. -- Peter Geoghegan
On 2022-Apr-15, Tom Lane wrote: > Here's a WIP patch for that. The only exciting thing in it is that > because of some undocumented cowboy programming in walsender.c, the > Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); > in ProcArrayInstallRestoredXmin fires unless we skip that. Hmm, maybe a better use of that define is to use to select which flags to copy, rather than to ensure we they are the only ones set. What about this? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "¿Qué importan los años? Lo que realmente importa es comprobar que a fin de cuentas la mejor edad de la vida es estar vivo" (Mafalda)
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2022-Apr-15, Tom Lane wrote: >> Here's a WIP patch for that. The only exciting thing in it is that >> because of some undocumented cowboy programming in walsender.c, the >> Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); >> in ProcArrayInstallRestoredXmin fires unless we skip that. > Hmm, maybe a better use of that define is to use to select which flags > to copy, rather than to ensure we they are the only ones set. What > about this? Yeah, I thought about that too, but figured that the author probably had a reason for writing the assertion the way it was. If we want to redefine PROC_COPYABLE_FLAGS as "flags associated with xmin", that's fine by me. But I'd suggest that both the name of the macro and the comment for it in proc.h should be revised to match that definition. Another point is that as you've coded it, the code doesn't so much copy those flags as union them with whatever the recipient had, which seems wrong. I could go with either Assert(!(MyProc->statusFlags & PROC_XMIN_FLAGS)); MyProc->statusFlags |= (proc->statusFlags & PROC_XMIN_FLAGS); or MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) | (proc->statusFlags & PROC_XMIN_FLAGS); Perhaps the latter is more future-proof. regards, tom lane
On Wed, Apr 20, 2022 at 3:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > On 2022-Apr-15, Tom Lane wrote: > >> Here's a WIP patch for that. The only exciting thing in it is that > >> because of some undocumented cowboy programming in walsender.c, the > >> Assert((proc->statusFlags & (~PROC_COPYABLE_FLAGS)) == 0); > >> in ProcArrayInstallRestoredXmin fires unless we skip that. > > > Hmm, maybe a better use of that define is to use to select which flags > > to copy, rather than to ensure we they are the only ones set. What > > about this? > > Yeah, I thought about that too, but figured that the author probably > had a reason for writing the assertion the way it was. The motivation behind the assertion was that when we copy whole statusFlags from the leader process to the worker process we want to make sure that the flags we're copying is a known subset of the flags that are valid to copy from the leader. > If we want > to redefine PROC_COPYABLE_FLAGS as "flags associated with xmin", > that's fine by me. But I'd suggest that both the name of the macro > and the comment for it in proc.h should be revised to match that > definition. > > Another point is that as you've coded it, the code doesn't so much > copy those flags as union them with whatever the recipient had, > which seems wrong. I could go with either > > Assert(!(MyProc->statusFlags & PROC_XMIN_FLAGS)); > MyProc->statusFlags |= (proc->statusFlags & PROC_XMIN_FLAGS); > > or > > MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) | > (proc->statusFlags & PROC_XMIN_FLAGS); > > Perhaps the latter is more future-proof. Copying only xmin-related flags in this way also makes sense to me and there is no problem at least for now. A note would be that when we introduce a new flag that needs to be copied in the future, we need to make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a similar issue we fixed by 0f0cfb494004befb0f6e could happen again. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On 2022-Apr-20, Masahiko Sawada wrote: > > MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) | > > (proc->statusFlags & PROC_XMIN_FLAGS); > > > > Perhaps the latter is more future-proof. > Copying only xmin-related flags in this way also makes sense to me and > there is no problem at least for now. A note would be that when we > introduce a new flag that needs to be copied in the future, we need to > make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a > similar issue we fixed by 0f0cfb494004befb0f6e could happen again. OK, done this way -- patch attached. Reading the comment I wrote about it, I wonder if flags PROC_AFFECTS_ALL_HORIZONS and PROC_IN_LOGICAL_DECODING should also be included. I think the only reason we don't care at this point is that walsenders (logical or otherwise) do not engage in snapshot copying. But if we were to implement usage of parallel workers sharing a common snapshot to do table sync in parallel, then it ISTM it would be important to copy at least the latter. Not sure there are any cases were we might care about the former. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Every machine is a smoke machine if you operate it wrong enough." https://twitter.com/libseybieda/status/1541673325781196801
Attachment
On Sun, May 15, 2022 at 12:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-Apr-20, Masahiko Sawada wrote: > > > > MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) | > > > (proc->statusFlags & PROC_XMIN_FLAGS); > > > > > > Perhaps the latter is more future-proof. > > > Copying only xmin-related flags in this way also makes sense to me and > > there is no problem at least for now. A note would be that when we > > introduce a new flag that needs to be copied in the future, we need to > > make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a > > similar issue we fixed by 0f0cfb494004befb0f6e could happen again. > > OK, done this way -- patch attached. Thank you for updating the patch. > > Reading the comment I wrote about it, I wonder if flags > PROC_AFFECTS_ALL_HORIZONS and PROC_IN_LOGICAL_DECODING should also be > included. I think the only reason we don't care at this point is that > walsenders (logical or otherwise) do not engage in snapshot copying. > But if we were to implement usage of parallel workers sharing a common > snapshot to do table sync in parallel, then it ISTM it would be > important to copy at least the latter. Not sure there are any cases > were we might care about the former. Yeah, it seems to be inconsistent between the comment (and the new name) and the flags actually included. I think we can include all xmin-related flags to PROC_XMIN_FLAGS as the comment says. That way, it would be useful also for other use cases, and I don't see any downside for now. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
I was looking at backpatching this to pg13. That made me realize that commit dc7420c2c927 changed things in 14; and before that commit, the bitmask that is checked is PROCARRAY_FLAGS_VACUUM, which has a definition independently from whatever proc.h says. As far as I can tell, there's no problem with the patches I post here (the backpatched version for pg13 and p14). But's it's something to be aware of; and if we do want to add the additional bits to the bitmask, we should do that in a separate master-only commit. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/