Thread: RE: PG12 autovac issues
Hi Andres- Thanks for the reply, answers below. On Tue, Mar 17, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-17 17:18:57 -0500, Justin King wrote: > > As you can see in this table, there are only ~80K rows, but billions > > of updates. What we have observed is that the frozenxid reaches the > > 200M mark fairly quickly because of the amount of activity. > > And each of those updates is in a separate transaction? Is that > required? I.e. any chance to perform multiple of those updates in one > transaction? > > Have you considered just increasing the vacuum limit? It's mostly there > because it can increase space usage a bit, but given today's systems its > not a usually a problem unless you have hundreds of postgres instances > on a single system. It's pretty common to just increase that to 1.8 > billion or so. We have considered increasing the limit, but as I mentioned, the problem that we're seeing is that (effectively) a autovac starts on a system database (postgres, template1) and never completes, or deadlocks, or something. This completely stops autovacs from running until we manually intervene and run a VACUUM FREEZE -- at which point, the autovacs resume. If we increase the vacuum limit and this situation happens, we're going to be in real trouble. > From a single stats snapshot we can't actually understand the actual xid > consumption - is it actually the xid usage that triggers the vacuums? We have looked at this and the xid consumption averages around 1250 xid/sec -- this is when we see the "aggressive" autovac kick off in the logs. What I don't understand is why these xid's are being consumed at this rate on the databases with no activity (postgres, template1). > > > What makes you think it is a problem that you have all these vacuums? If > you actually update that much, and you have indexes, you're going want a > lot of vacuums? > I actually don't think there's a problem with the vacuums (I was mostly pointing out that they are very regular and not problematic). The main problem I am having is that something is causing the autovacuums to completely stop and require manual intervention to resume -- and it seems to be when the "postgres" or "template1" database hits the autovacuum_freeze_max_age. > > > What is interesting is that this happens with the 'postgres' and > > 'template1' databases as well and there is absolutely no activity in > > those databases. > > That's normal. They should be pretty darn quick in v12? Yes, a manual VACUUM FREEZE of either database takes less than 1 second -- which is why it's perplexing that the autovac starts and never seems to complete and prevents other autovacs from running. > > Greetings, > > Andres Freund
Hi, On 2020-03-19 10:23:48 -0500, Justin King wrote: > > From a single stats snapshot we can't actually understand the actual xid > > consumption - is it actually the xid usage that triggers the vacuums? > > We have looked at this and the xid consumption averages around 1250 > xid/sec -- this is when we see the "aggressive" autovac kick off in > the logs. What I don't understand is why these xid's are being > consumed at this rate on the databases with no activity (postgres, > template1). The xid counter is global across all databases. > > What makes you think it is a problem that you have all these vacuums? If > > you actually update that much, and you have indexes, you're going want a > > lot of vacuums? > I actually don't think there's a problem with the vacuums (I was > mostly pointing out that they are very regular and not problematic). > The main problem I am having is that something is causing the > autovacuums to completely stop and require manual intervention to > resume -- and it seems to be when the "postgres" or "template1" > database hits the autovacuum_freeze_max_age. Did you look at pg_stat_activity for those autovacuums to see whether they're blocked on something? > > > What is interesting is that this happens with the 'postgres' and > > > 'template1' databases as well and there is absolutely no activity in > > > those databases. > > > > That's normal. They should be pretty darn quick in v12? > > Yes, a manual VACUUM FREEZE of either database takes less than 1 > second -- which is why it's perplexing that the autovac starts and > never seems to complete and prevents other autovacs from running. One big difference between a manual VACUUM and autovacuum is that with the default settings VACUUM is not throttled, but autovacuum is. What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, vacuum_cost_page_miss set to? Greetings, Andres Freund
On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-19 10:23:48 -0500, Justin King wrote: > > > From a single stats snapshot we can't actually understand the actual xid > > > consumption - is it actually the xid usage that triggers the vacuums? > > > > We have looked at this and the xid consumption averages around 1250 > > xid/sec -- this is when we see the "aggressive" autovac kick off in > > the logs. What I don't understand is why these xid's are being > > consumed at this rate on the databases with no activity (postgres, > > template1). > > The xid counter is global across all databases. Then what does the "age" value represent for each database in this case? Perhaps I'm misunderstanding what I'm looking at? postgres=# SELECT datname, age(datfrozenxid), current_setting('autovacuum_freeze_max_age') FROM pg_database; datname | age | current_setting -----------+-----------+----------------- postgres | 100937449 | 200000000 template1 | 50244438 | 200000000 template0 | 160207297 | 200000000 feedi | 150147602 | 200000000 > > > > What makes you think it is a problem that you have all these vacuums? If > > > you actually update that much, and you have indexes, you're going want a > > > lot of vacuums? > > > I actually don't think there's a problem with the vacuums (I was > > mostly pointing out that they are very regular and not problematic). > > The main problem I am having is that something is causing the > > autovacuums to completely stop and require manual intervention to > > resume -- and it seems to be when the "postgres" or "template1" > > database hits the autovacuum_freeze_max_age. > > Did you look at pg_stat_activity for those autovacuums to see whether > they're blocked on something? This is not something we've done yet but will next time it occurs. > > > > What is interesting is that this happens with the 'postgres' and > > > > 'template1' databases as well and there is absolutely no activity in > > > > those databases. > > > > > > That's normal. They should be pretty darn quick in v12? > > > > Yes, a manual VACUUM FREEZE of either database takes less than 1 > > second -- which is why it's perplexing that the autovac starts and > > never seems to complete and prevents other autovacs from running. > > One big difference between a manual VACUUM and autovacuum is that with > the default settings VACUUM is not throttled, but autovacuum is. > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, > vacuum_cost_page_miss set to? Here are all the vacuum related values for the server: postgres=# select name,setting from pg_settings where name like '%vacuum%'; autovacuum = on autovacuum_analyze_scale_factor = 0.1 autovacuum_analyze_threshold = 2500 autovacuum_freeze_max_age = 200000000 autovacuum_max_workers = 8 autovacuum_multixact_freeze_max_age = 400000000 autovacuum_naptime = 15 autovacuum_vacuum_cost_delay = 20 autovacuum_vacuum_cost_limit = -1 autovacuum_vacuum_scale_factor = 0.2 autovacuum_vacuum_threshold = 500 autovacuum_work_mem = -1 log_autovacuum_min_duration = 0 vacuum_cleanup_index_scale_factor = 0.1 vacuum_cost_delay = 0 vacuum_cost_limit = 1000 vacuum_cost_page_dirty = 20 vacuum_cost_page_hit = 1 vacuum_cost_page_miss = 10 vacuum_defer_cleanup_age = 0 vacuum_freeze_min_age = 50000000 vacuum_freeze_table_age = 150000000 vacuum_multixact_freeze_min_age = 5000000 vacuum_multixact_freeze_table_age = 150000000 I know the database is busy, so the throttling makes sense, but it seems like it would complete eventually. We see blocked autovacs for many hours. > Greetings, > > Andres Freund
Hi, On 2020-03-19 18:07:14 -0500, Justin King wrote: > On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2020-03-19 10:23:48 -0500, Justin King wrote: > > > > From a single stats snapshot we can't actually understand the actual xid > > > > consumption - is it actually the xid usage that triggers the vacuums? > > > > > > We have looked at this and the xid consumption averages around 1250 > > > xid/sec -- this is when we see the "aggressive" autovac kick off in > > > the logs. What I don't understand is why these xid's are being > > > consumed at this rate on the databases with no activity (postgres, > > > template1). > > > > The xid counter is global across all databases. > > Then what does the "age" value represent for each database in this > case? Perhaps I'm misunderstanding what I'm looking at? > > postgres=# SELECT datname, age(datfrozenxid), > current_setting('autovacuum_freeze_max_age') FROM pg_database; > datname | age | current_setting > -----------+-----------+----------------- > postgres | 100937449 | 200000000 > template1 | 50244438 | 200000000 > template0 | 160207297 | 200000000 > feedi | 150147602 | 200000000 Look at datfrozenxid without the age(). age(xid) computes how "old" xid is compared to the "next" xid to be assigned. Until vacuum comes around and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are constant, since they represent the values actually present in the table. But if xids are being consumed, their "age" increases, because they're further and further in the past relative to the "newest" xids. > > One big difference between a manual VACUUM and autovacuum is that with > > the default settings VACUUM is not throttled, but autovacuum is. > > > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, > > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, > > vacuum_cost_page_miss set to? > > Here are all the vacuum related values for the server: > > postgres=# select name,setting from pg_settings where name like '%vacuum%'; > autovacuum = on > autovacuum_analyze_scale_factor = 0.1 > autovacuum_analyze_threshold = 2500 > autovacuum_freeze_max_age = 200000000 > autovacuum_max_workers = 8 > autovacuum_multixact_freeze_max_age = 400000000 > autovacuum_naptime = 15 > autovacuum_vacuum_cost_delay = 20 > autovacuum_vacuum_cost_limit = -1 > autovacuum_vacuum_scale_factor = 0.2 > autovacuum_vacuum_threshold = 500 > autovacuum_work_mem = -1 > log_autovacuum_min_duration = 0 > vacuum_cleanup_index_scale_factor = 0.1 > vacuum_cost_delay = 0 > vacuum_cost_limit = 1000 > vacuum_cost_page_dirty = 20 > vacuum_cost_page_hit = 1 > vacuum_cost_page_miss = 10 > vacuum_defer_cleanup_age = 0 > vacuum_freeze_min_age = 50000000 > vacuum_freeze_table_age = 150000000 > vacuum_multixact_freeze_min_age = 5000000 > vacuum_multixact_freeze_table_age = 150000000 > > I know the database is busy, so the throttling makes sense, but it > seems like it would complete eventually. The cost limit/delay are way too long/small respectively for a busy postgres instance. > We see blocked autovacs for many hours. On the same table, or just generally being busy? Greetings, Andres Freund
On Thu, Mar 19, 2020 at 6:56 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-19 18:07:14 -0500, Justin King wrote: > > On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2020-03-19 10:23:48 -0500, Justin King wrote: > > > > > From a single stats snapshot we can't actually understand the actual xid > > > > > consumption - is it actually the xid usage that triggers the vacuums? > > > > > > > > We have looked at this and the xid consumption averages around 1250 > > > > xid/sec -- this is when we see the "aggressive" autovac kick off in > > > > the logs. What I don't understand is why these xid's are being > > > > consumed at this rate on the databases with no activity (postgres, > > > > template1). > > > > > > The xid counter is global across all databases. > > > > Then what does the "age" value represent for each database in this > > case? Perhaps I'm misunderstanding what I'm looking at? > > > > postgres=# SELECT datname, age(datfrozenxid), > > current_setting('autovacuum_freeze_max_age') FROM pg_database; > > datname | age | current_setting > > -----------+-----------+----------------- > > postgres | 100937449 | 200000000 > > template1 | 50244438 | 200000000 > > template0 | 160207297 | 200000000 > > feedi | 150147602 | 200000000 > > Look at datfrozenxid without the age(). age(xid) computes how "old" xid > is compared to the "next" xid to be assigned. Until vacuum comes around > and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are > constant, since they represent the values actually present in the > table. But if xids are being consumed, their "age" increases, because > they're further and further in the past relative to the "newest" xids. > > > > > One big difference between a manual VACUUM and autovacuum is that with > > > the default settings VACUUM is not throttled, but autovacuum is. > > > > > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, > > > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, > > > vacuum_cost_page_miss set to? > > > > Here are all the vacuum related values for the server: > > > > postgres=# select name,setting from pg_settings where name like '%vacuum%'; > > autovacuum = on > > autovacuum_analyze_scale_factor = 0.1 > > autovacuum_analyze_threshold = 2500 > > autovacuum_freeze_max_age = 200000000 > > autovacuum_max_workers = 8 > > autovacuum_multixact_freeze_max_age = 400000000 > > autovacuum_naptime = 15 > > autovacuum_vacuum_cost_delay = 20 > > autovacuum_vacuum_cost_limit = -1 > > autovacuum_vacuum_scale_factor = 0.2 > > autovacuum_vacuum_threshold = 500 > > autovacuum_work_mem = -1 > > log_autovacuum_min_duration = 0 > > vacuum_cleanup_index_scale_factor = 0.1 > > vacuum_cost_delay = 0 > > vacuum_cost_limit = 1000 > > vacuum_cost_page_dirty = 20 > > vacuum_cost_page_hit = 1 > > vacuum_cost_page_miss = 10 > > vacuum_defer_cleanup_age = 0 > > vacuum_freeze_min_age = 50000000 > > vacuum_freeze_table_age = 150000000 > > vacuum_multixact_freeze_min_age = 5000000 > > vacuum_multixact_freeze_table_age = 150000000 > > > > I know the database is busy, so the throttling makes sense, but it > > seems like it would complete eventually. > > The cost limit/delay are way too long/small respectively for a busy > postgres instance. This does make sense, we will look into adjusting those values. > > We see blocked autovacs for many hours. > > On the same table, or just generally being busy? We haven't isolated *which* table it is blocked on (assuming it is), but all autovac's cease running until we manually intervene. When we get into this state again, is there some other information (other than what is in pg_stat_statement or pg_stat_activity) that would be useful for folks here to help understand what is going on? > > Greetings, > > Andres Freund
Hi, On 2020-03-20 12:42:31 -0500, Justin King wrote: > When we get into this state again, is there some other information > (other than what is in pg_stat_statement or pg_stat_activity) that > would be useful for folks here to help understand what is going on? If it's actually stuck on a single table, and that table is not large, it would be useful to get a backtrace with gdb. Greetings, Andres Freund
We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.
When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?
On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote: > Hi, > > On 2020-03-20 12:42:31 -0500, Justin King wrote: > > When we get into this state again, is there some other information > > (other than what is in pg_stat_statement or pg_stat_activity) that > > would be useful for folks here to help understand what is going on? > > If it's actually stuck on a single table, and that table is not large, > it would be useful to get a backtrace with gdb. FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc) during performance tests since a recent upgrade to pg12 . What seems to be happening is that after reaching 200M transaction a first pass of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k (age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards). After that point, all available information seems to indicate that no autovacuum worker is scheduled anymore: - log_autovacuum_min_duration is set to 0 and no activity is logged (while having thousands of those per hour before that) - 15 min interval snapshot of pg_database and pg_class shows that datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never goes down - 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum worker - the autovacuum launcher is up and running and doesn't show any sign of problem - n_mod_since_analyze keeps growing at a consistent rate, never going down - 15 min delta of tup_updated and tup_deleted shows that the globate write activity doesn't change before and after the autovacuum problem The situation continues for ~2h, at which point the bloat is so heavy that the main filesystem becomes full, and postgres panics after a failed write in pg_logical directory or similar. The same bench was run against pg11 many times and never triggered this issue. So far our best guess is a side effect of 2aa6e331ead7. Michael and I have been trying to reproduce this issue locally (drastically reducing the various freeze_age parameters) for hours, but no luck for now. This is using a vanilla pg 12.1, with some OLTP workload. The only possibly relevant configuration changes are quite aggressive autovacuum settings on some tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale factor to 0, for both heap and toast).
Hi, On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote: > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote: > > Hi, > > > > On 2020-03-20 12:42:31 -0500, Justin King wrote: > > > When we get into this state again, is there some other information > > > (other than what is in pg_stat_statement or pg_stat_activity) that > > > would be useful for folks here to help understand what is going on? > > > > If it's actually stuck on a single table, and that table is not large, > > it would be useful to get a backtrace with gdb. > > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc) > during performance tests since a recent upgrade to pg12 . > > What seems to be happening is that after reaching 200M transaction a first pass > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k > (age(datfrozenxid) still being more than autovacuum_freeze_max_age > afterwards). If you have older transactions around that'd not be surprising. Do you have autovacuum logging output for this case? > After that point, all available information seems to indicate that no > autovacuum worker is scheduled anymore: Do you mean that this table doesn't get autovac'ed at all anymore, that no table is getting autovac'd, or just that there's nothing further increasing relfrozenxid for that table? It sounds like: > - log_autovacuum_min_duration is set to 0 and no activity is logged (while > having thousands of those per hour before that) no table at all? > - 15 min interval snapshot of pg_database and pg_class shows that > datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never > goes down I assume you mean their age? What is: - datfrozenxid, age(datfrozenxid) for the oldest database SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; - Oldest backend xmin SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESCLIMIT 3; - oldest replication xmin: SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; - oldest slot xmin: SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; - oldest prepared transaction (unfortunately xmin not available) SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; > The same bench was run against pg11 many times and never triggered this issue. > So far our best guess is a side effect of 2aa6e331ead7. In that case you'd likely see DEBUG1 output, right? Did you try running with that? > Michael and I have been trying to reproduce this issue locally (drastically > reducing the various freeze_age parameters) for hours, but no luck for now. Have you considered using gdb to investigate? > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > relevant configuration changes are quite aggressive autovacuum settings on some > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > factor to 0, for both heap and toast). That, uh, is not an insignificant set of changes for an autovac scheduling issues. It sounds like what might be happening is that you have something holding back the "oldest needed transaction" horizon. Before 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated vacuums unable to increase the horizon, but afterwards they'll all immediately exit without any messages. I wonder if what might be happening is that we're somehow missed/failed to update relfrozenxid and/or datfrozenxid. If you manually vacuum some table in the oldest database, but that is *NOT* the oldest table itself, does the problem "resolve" itself? Greetings, Andres Freund
Hi, On Mon, Mar 23, 2020 at 09:23:03AM -0700, Andres Freund wrote: > Hi, > > On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote: > > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2020-03-20 12:42:31 -0500, Justin King wrote: > > > > When we get into this state again, is there some other information > > > > (other than what is in pg_stat_statement or pg_stat_activity) that > > > > would be useful for folks here to help understand what is going on? > > > > > > If it's actually stuck on a single table, and that table is not large, > > > it would be useful to get a backtrace with gdb. > > > > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc) > > during performance tests since a recent upgrade to pg12 . > > > > What seems to be happening is that after reaching 200M transaction a first pass > > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k > > (age(datfrozenxid) still being more than autovacuum_freeze_max_age > > afterwards). > > If you have older transactions around that'd not be surprising. Do you > have autovacuum logging output for this case? There's an hourly long running query that can retain xmin up to a few million xid, but definitely not something close to 200M. When I said a first pass, it a batch of vacuum. Here's the overview of "to prevent wraparound" per-hour logs (UTC timezone): 442 2020-03-14 07 1686 2020-03-14 08 14 2020-03-14 10 and an overview of all autovacuum activity (still in UTC): 1366 2020-03-14 00 1457 2020-03-14 01 1387 2020-03-14 02 1440 2020-03-14 03 1349 2020-03-14 04 7383 2020-03-14 05 13909 2020-03-14 06 14240 2020-03-14 07 2094 2020-03-14 08 0 2020-03-14 09 16 2020-03-14 10 3 2020-03-14 11 4 2020-03-14 12 3 2020-03-14 13 The final outage being: 2020-03-14 10:27:23.280 UTC [...] ERROR: could not extend file "base/16386/20245.4": No space left on device 536222:2020-03-14 10:40:00.089 UTC [...] PANIC: could not write to file "pg_logical/replorigin_checkpoint.tmp": No spaceleft on device the last autovacuum evidence before that being: 274177:2020-03-14 08:54:11.797 UTC 5e6c8ed0.d665 0 LOG: automatic vacuum of table "pg_toast.pg_toast_20237": index scans:1 274178- pages: 0 removed, 273666 remain, 0 skipped due to pins, 251545 skipped frozen 274179- tuples: 83585 removed, 749 remain, 209 are dead but not yet removable, oldest xmin: 210363848 274180- buffer usage: 50096 hits, 23521 misses, 19996 dirtied 274181- avg read rate: 99.195 MB/s, avg write rate: 84.329 MB/s 274182- system usage: CPU: user: 0.28 s, system: 0.21 s, elapsed: 1.85 s After the crash-and-restart autovacuum is working again (as seen in the 10AM - 1PM logs), although the bench stays stopped. What I can see in pg_database is (GMT+1 here): timestamp | datfrozenxid | age(datfrozenxid) [...] 2020-03-14 09:12:11.279515+01 | 480 | 202554804 2020-03-14 09:27:12.723617+01 | 480 | 205408276 2020-03-14 09:42:13.868701+01 | 480 | 208239967 2020-03-14 09:57:15.685827+01 | 827585 | 210143294 2020-03-14 10:12:17.488993+01 | 827585 | 213143797 2020-03-14 10:27:18.899525+01 | 827585 | 216104516 2020-03-14 10:42:19.926601+01 | 827585 | 219075040 2020-03-14 10:57:21.023513+01 | 827585 | 222085423 2020-03-14 11:12:22.85198+01 | 827585 | 225057731 > > After that point, all available information seems to indicate that no > > autovacuum worker is scheduled anymore: > > Do you mean that this table doesn't get autovac'ed at all anymore, that > no table is getting autovac'd, or just that there's nothing further > increasing relfrozenxid for that table? > > It sounds like: > > > - log_autovacuum_min_duration is set to 0 and no activity is logged (while > > having thousands of those per hour before that) > > no table at all? Correct, no table being autovacuumed, no sign of autovacuum being scheduled or anything. > > - 15 min interval snapshot of pg_database and pg_class shows that > > datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never > > goes down > > I assume you mean their age? Yes sorry. > What is: > - datfrozenxid, age(datfrozenxid) for the oldest database > SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; see above. FTR there's only one database being used. > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; The vm has been trashed since, and I don't have that level of detail available in the gathered stats unfortunately (and the available information I have is a little bit messy, sorry for that). > - Oldest backend xmin > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > - oldest replication xmin: > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > - oldest slot xmin: > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > - oldest prepared transaction (unfortunately xmin not available) > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 (which may be dumb). Here are the non-null rows after 8AM GMT+1: timestamp | kind | age | x_min -------------------------------+-------------+---------+----------- 2020-03-14 08:12:04.082523+01 | active_xact | 968 | 191022091 2020-03-14 08:12:04.082523+01 | idle_xact | 968 | 191022091 2020-03-14 08:12:04.082523+01 | query | 2538322 | 188484737 2020-03-14 08:27:07.046938+01 | active_xact | 606 | 193917749 2020-03-14 08:27:07.046938+01 | idle_xact | 606 | 193917749 2020-03-14 08:27:07.046938+01 | query | 39646 | 193878709 2020-03-14 08:42:08.902248+01 | active_xact | 502 | 196764934 2020-03-14 08:42:08.902248+01 | idle_xact | 502 | 196764934 2020-03-14 08:42:08.902248+01 | query | 12787 | 196752649 2020-03-14 08:57:10.089606+01 | active_xact | 91 | 199684930 2020-03-14 08:57:10.089606+01 | idle_xact | 91 | 199684930 2020-03-14 08:57:10.089606+01 | query | 25453 | 199659568 2020-03-14 09:12:11.279515+01 | active_xact | 264 | 202555020 2020-03-14 09:12:11.279515+01 | idle_xact | 264 | 202555020 2020-03-14 09:12:11.279515+01 | query | 2316084 | 200239200 2020-03-14 09:27:12.723617+01 | active_xact | 90 | 205408666 2020-03-14 09:27:12.723617+01 | idle_xact | 90 | 205408666 2020-03-14 09:27:12.723617+01 | query | 5169556 | 200239200 2020-03-14 09:42:13.868701+01 | active_xact | 381 | 208240066 2020-03-14 09:42:13.868701+01 | idle_xact | 381 | 208240066 2020-03-14 09:42:13.868701+01 | query | 934819 | 207305628 2020-03-14 09:57:15.685827+01 | active_xact | 791 | 210970088 2020-03-14 09:57:15.685827+01 | idle_xact | 791 | 210970088 2020-03-14 09:57:15.685827+01 | query | 791 | 210970088 2020-03-14 10:12:17.488993+01 | active_xact | 243 | 213971139 2020-03-14 10:12:17.488993+01 | idle_xact | 61 | 213971321 2020-03-14 10:27:18.899525+01 | active_xact | 105 | 216931996 2020-03-14 10:27:18.899525+01 | idle_xact | 105 | 216931996 2020-03-14 10:27:18.899525+01 | query | 109 | 216931992 2020-03-14 10:42:19.926601+01 | active_xact | 1046073 | 218856552 2020-03-14 10:42:19.926601+01 | idle_xact | 14 | 219902611 2020-03-14 10:42:19.926601+01 | query | 234 | 219902391 2020-03-14 10:57:21.023513+01 | active_xact | 707 | 222912301 2020-03-14 10:57:21.023513+01 | idle_xact | 707 | 222912301 2020-03-14 10:57:21.023513+01 | query | 707 | 222912301 2020-03-14 11:12:22.85198+01 | active_xact | 302 | 225885014 2020-03-14 11:12:22.85198+01 | idle_xact | 302 | 225885014 2020-03-14 11:12:22.85198+01 | query | 302 | 225885014 > > The same bench was run against pg11 many times and never triggered this issue. > > So far our best guess is a side effect of 2aa6e331ead7. > > In that case you'd likely see DEBUG1 output, right? Did you try running > with that? That's unfortunately not an option, as while the issue is reproducible, it happens after roughly 16h with an average of 6.5ktps, and we definitely don't have enough disk space for that amount of logs. We'll try next time with a recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog. > > Michael and I have been trying to reproduce this issue locally (drastically > > reducing the various freeze_age parameters) for hours, but no luck for now. > > Have you considered using gdb to investigate? Not for now, because it's hard to be around for the ~ 90min interval the system stays alive after atuvacuum get stuck, especially since it's a weekend bench. > > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > > relevant configuration changes are quite aggressive autovacuum settings on some > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > > factor to 0, for both heap and toast). > > That, uh, is not an insignificant set of changes for an autovac > scheduling issues. > > It sounds like what might be happening is that you have something > holding back the "oldest needed transaction" horizon. Before > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > vacuums unable to increase the horizon, but afterwards they'll all > immediately exit without any messages. The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can see for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from 200069684 to 210363848. E.g.: [...] 266603:2020-03-14 08:49:59.927 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683":index scans: 0 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921 266606- buffer usage: 25 hits, 1 misses, 1 dirtied 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 266609:2020-03-14 08:49:59.929 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694":index scans: 0 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 266612- buffer usage: 25 hits, 1 misses, 1 dirtied 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 266615:2020-03-14 08:49:59.931 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845":index scans: 0 266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen 266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 266618- buffer usage: 25 hits, 1 misses, 2 dirtied 266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s 266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 266621:2020-03-14 08:49:59.945 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712":index scans: 0 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953 [...] > I wonder if what might be happening is that we're somehow missed/failed > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some > table in the oldest database, but that is *NOT* the oldest table itself, > does the problem "resolve" itself? I'll also add pg_class snapshot for next time we run the bench. I'm not sure if we'll be able to catch the 1h45 interval when the system stays alive after the issue though.
Hi, On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database > > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; > > The vm has been trashed since, and I don't have that level of detail available > in the gathered stats unfortunately (and the available information I have is a > little bit messy, sorry for that). > > > - Oldest backend xmin > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > - oldest replication xmin: > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > - oldest slot xmin: > > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > > - oldest prepared transaction (unfortunately xmin not available) > > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; > > I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 > (which may be dumb). Here are the non-null rows after 8AM GMT+1: Could you share what the config of the server was? > > > The same bench was run against pg11 many times and never triggered this issue. > > > So far our best guess is a side effect of 2aa6e331ead7. > > > > In that case you'd likely see DEBUG1 output, right? Did you try running > > with that? > > That's unfortunately not an option, as while the issue is reproducible, it > happens after roughly 16h with an average of 6.5ktps, and we definitely don't > have enough disk space for that amount of logs. We'll try next time with a > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog. You wouldn't need to run it with DEBUG1 the whole time, you could just change the config and reload config once you hit the issue. > > > Michael and I have been trying to reproduce this issue locally (drastically > > > reducing the various freeze_age parameters) for hours, but no luck for now. > > > > Have you considered using gdb to investigate? > > Not for now, because it's hard to be around for the ~ 90min interval the system > stays alive after atuvacuum get stuck, especially since it's a weekend bench. > > > > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > > > relevant configuration changes are quite aggressive autovacuum settings on some > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > > > factor to 0, for both heap and toast). > > > > That, uh, is not an insignificant set of changes for an autovac > > scheduling issues. > > > > It sounds like what might be happening is that you have something > > holding back the "oldest needed transaction" horizon. Before > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > > vacuums unable to increase the horizon, but afterwards they'll all > > immediately exit without any messages. > > The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can see > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from > 200069684 to 210363848. E.g.: Which database is this on? > [...] > 266603:2020-03-14 08:49:59.927 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683":index scans: 0 > 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921 > 266606- buffer usage: 25 hits, 1 misses, 1 dirtied > 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s > 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > 266609:2020-03-14 08:49:59.929 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694":index scans: 0 > 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > 266612- buffer usage: 25 hits, 1 misses, 1 dirtied > 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s > 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > 266615:2020-03-14 08:49:59.931 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845":index scans: 0 > 266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > 266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > 266618- buffer usage: 25 hits, 1 misses, 2 dirtied > 266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s > 266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > 266621:2020-03-14 08:49:59.945 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712":index scans: 0 > 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953 > [...] Do you have any non-toast ones? > > I wonder if what might be happening is that we're somehow missed/failed > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some > > table in the oldest database, but that is *NOT* the oldest table itself, > > does the problem "resolve" itself? > > I'll also add pg_class snapshot for next time we run the bench. I'm not sure > if we'll be able to catch the 1h45 interval when the system stays alive > after the issue though. Could you just script something to stop the benchmark once the disk is 90% full or so? Did you see any errors / fatals around the time autovacuum stopped working? Greetings, Andres Freund
On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database > > > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; > > > > The vm has been trashed since, and I don't have that level of detail available > > in the gathered stats unfortunately (and the available information I have is a > > little bit messy, sorry for that). > > > > > - Oldest backend xmin > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > > - oldest replication xmin: > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > > - oldest slot xmin: > > > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > > > - oldest prepared transaction (unfortunately xmin not available) > > > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; > > > > I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 > > (which may be dumb). Here are the non-null rows after 8AM GMT+1: > > Could you share what the config of the server was? > > > > > > The same bench was run against pg11 many times and never triggered this issue. > > > > So far our best guess is a side effect of 2aa6e331ead7. > > > > > > In that case you'd likely see DEBUG1 output, right? Did you try running > > > with that? > > > > That's unfortunately not an option, as while the issue is reproducible, it > > happens after roughly 16h with an average of 6.5ktps, and we definitely don't > > have enough disk space for that amount of logs. We'll try next time with a > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog. > > You wouldn't need to run it with DEBUG1 the whole time, you could just > change the config and reload config once you hit the issue. > > > > > > Michael and I have been trying to reproduce this issue locally (drastically > > > > reducing the various freeze_age parameters) for hours, but no luck for now. > > > > > > Have you considered using gdb to investigate? > > > > Not for now, because it's hard to be around for the ~ 90min interval the system > > stays alive after atuvacuum get stuck, especially since it's a weekend bench. > > > > > > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > > > > relevant configuration changes are quite aggressive autovacuum settings on some > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > > > > factor to 0, for both heap and toast). > > > > > > That, uh, is not an insignificant set of changes for an autovac > > > scheduling issues. > > > > > > It sounds like what might be happening is that you have something > > > holding back the "oldest needed transaction" horizon. Before > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > > > vacuums unable to increase the horizon, but afterwards they'll all > > > immediately exit without any messages. > > > > The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can see > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from > > 200069684 to 210363848. E.g.: > > Which database is this on? > > > > [...] > > 266603:2020-03-14 08:49:59.927 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683":index scans: 0 > > 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921 > > 266606- buffer usage: 25 hits, 1 misses, 1 dirtied > > 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s > > 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > 266609:2020-03-14 08:49:59.929 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694":index scans: 0 > > 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > > 266612- buffer usage: 25 hits, 1 misses, 1 dirtied > > 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s > > 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > 266615:2020-03-14 08:49:59.931 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845":index scans: 0 > > 266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > 266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > > 266618- buffer usage: 25 hits, 1 misses, 2 dirtied > > 266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s > > 266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > 266621:2020-03-14 08:49:59.945 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712":index scans: 0 > > 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953 > > [...] > > Do you have any non-toast ones? This is occurring in our environment right now (started about 30 min ago). Here 's the latest logs (grepped by vacuum): Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542 GMT [15569] LOG: automatic vacuum of table "feedi.production.tita": index scans: 1 Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964 GMT [15654] LOG: automatic vacuum of table "feedi.production.distributed_virtual_schedule": index scans: 1 Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103 GMT [15654] LOG: automatic vacuum of table "feedi.production.tita": index scans: 1 Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284 GMT [15654] LOG: automatic vacuum of table "feedi.pg_toast.pg_toast_2619": index scans: 1 Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005 GMT [15759] LOG: automatic vacuum of table "feedi.production.flightplans": index scans: 1 Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979 GMT [15759] LOG: automatic vacuum of table "feedi.production.tita": index scans: 1 Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367 GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_statistic": index scans: 0 Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378 GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_type": index scans: 0 Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389 GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_foreign_server": index scans: 0 Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410 GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0 Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420 GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_user_mapping": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_attribute": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_proc": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_class": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_attrdef": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_constraint": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_inherits": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_index": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_operator": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_opfamily": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_opclass": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_am": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_amop": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_amproc": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_language": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_aggregate": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_largeobject": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_statistic_ext": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_rewrite": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_trigger": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_event_trigger": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_description": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_cast": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_enum": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_namespace": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_conversion": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_depend": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_db_role_setting": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_pltemplate": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_shdepend": index scans: 1 Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_shdescription": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_ts_config": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_ts_config_map": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_ts_dict": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_ts_parser": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_ts_template": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_extension": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_foreign_table": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_policy": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_default_acl": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_init_privs": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_seclabel": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_collation": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_partitioned_table": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_range": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_transform": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_sequence": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_publication": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_publication_rel": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_catalog.pg_subscription_rel": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_packages": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_features": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_implementation_info": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_parts": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_languages": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_sizing": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.information_schema.sql_sizing_profiles": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2600": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2604": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3456": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2606": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_826": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2609": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3466": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3079": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2328": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1417": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3118": index scans: 0 Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3394": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2612": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2615": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3350": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3256": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1255": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2618": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3596": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2619": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3381": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3429": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2620": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3600": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1247": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1418": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1260": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1262": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2964": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1136": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_6000": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_2396": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_3592": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_6100": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_1213": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13267": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13272": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13287": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13277": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13282": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13292": index scans: 0 Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317 GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of table "postgres.pg_toast.pg_toast_13297": index scans: 0 > > > > > I wonder if what might be happening is that we're somehow missed/failed > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some > > > table in the oldest database, but that is *NOT* the oldest table itself, > > > does the problem "resolve" itself? postgres=# SELECT datname , age(datfrozenxid) , current_setting('autovacuum_freeze_max_age') FROM pg_database; datname | age | current_setting -----------+-----------+----------------- postgres | 202375735 | 200000000 template1 | 202345459 | 200000000 template0 | 132459914 | 200000000 feedi | 132459914 | 200000000 (4 rows) > > > > I'll also add pg_class snapshot for next time we run the bench. I'm not sure > > if we'll be able to catch the 1h45 interval when the system stays alive > > after the issue though. > > Could you just script something to stop the benchmark once the disk is > 90% full or so? > > Did you see any errors / fatals around the time autovacuum stopped > working? There are no ERROR or FATAL messages around the time -- (or at all). Since this is occurring right now, what else would be useful to capture? You'd asked about a GDB -- do you want that of the main process or the autovac worker? > > Greetings, > > Andres Freund
Hi, On 2020-03-23 16:31:21 -0500, Justin King wrote: > This is occurring in our environment right now (started about 30 min > ago). Here 's the latest logs (grepped by vacuum): > > Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542 > GMT [15569] LOG: automatic vacuum of table "feedi.production.tita": > index scans: 1 > Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964 > GMT [15654] LOG: automatic vacuum of table > "feedi.production.distributed_virtual_schedule": index scans: 1 Hm, unfortunately you've cut off the details in the subsequent lines. There's a few newlines in the output. Any chance you could re-post with those included? > > > > I wonder if what might be happening is that we're somehow missed/failed > > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some > > > > table in the oldest database, but that is *NOT* the oldest table itself, > > > > does the problem "resolve" itself? > > postgres=# SELECT datname > , age(datfrozenxid) > , current_setting('autovacuum_freeze_max_age') > FROM pg_database; > datname | age | current_setting > -----------+-----------+----------------- > postgres | 202375735 | 200000000 > template1 | 202345459 | 200000000 > template0 | 132459914 | 200000000 > feedi | 132459914 | 200000000 > (4 rows) Can you show the oldest tables in 'feedi'? Or, hm, actually, could you just post the result of all the queries from the "What is:" section in https://postgr.es/m/20200323162303.s7ay5hjdvimtkz6u%40alap3.anarazel.de > Since this is occurring right now, what else would be useful to > capture? You'd asked about a GDB -- do you want that of the main > process or the autovac worker? Unless you can give me gdb access directly, I don't yet have enough data to suggest what exactly we would want to analyze with gdb in your case. It'd be helpful if you could change log_min_messages to DEBUG1 and reload the configuration (not restart!). Greetings, Andres Freund
On Mon, Mar 23, 2020 at 4:31 PM Justin King <kingpin867@gmail.com> wrote: > > On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: > > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database > > > > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; > > > > > > The vm has been trashed since, and I don't have that level of detail available > > > in the gathered stats unfortunately (and the available information I have is a > > > little bit messy, sorry for that). > > > > > > > - Oldest backend xmin > > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > > > - oldest replication xmin: > > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; > > > > - oldest slot xmin: > > > > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 ORcatalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > > > > - oldest prepared transaction (unfortunately xmin not available) > > > > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; > > > > > > I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 > > > (which may be dumb). Here are the non-null rows after 8AM GMT+1: > > > > Could you share what the config of the server was? > > > > > > > > > The same bench was run against pg11 many times and never triggered this issue. > > > > > So far our best guess is a side effect of 2aa6e331ead7. > > > > > > > > In that case you'd likely see DEBUG1 output, right? Did you try running > > > > with that? > > > > > > That's unfortunately not an option, as while the issue is reproducible, it > > > happens after roughly 16h with an average of 6.5ktps, and we definitely don't > > > have enough disk space for that amount of logs. We'll try next time with a > > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog. > > > > You wouldn't need to run it with DEBUG1 the whole time, you could just > > change the config and reload config once you hit the issue. > > > > > > > > > Michael and I have been trying to reproduce this issue locally (drastically > > > > > reducing the various freeze_age parameters) for hours, but no luck for now. > > > > > > > > Have you considered using gdb to investigate? > > > > > > Not for now, because it's hard to be around for the ~ 90min interval the system > > > stays alive after atuvacuum get stuck, especially since it's a weekend bench. > > > > > > > > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > > > > > relevant configuration changes are quite aggressive autovacuum settings on some > > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > > > > > factor to 0, for both heap and toast). > > > > > > > > That, uh, is not an insignificant set of changes for an autovac > > > > scheduling issues. > > > > > > > > It sounds like what might be happening is that you have something > > > > holding back the "oldest needed transaction" horizon. Before > > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > > > > vacuums unable to increase the horizon, but afterwards they'll all > > > > immediately exit without any messages. > > > > > > The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can see > > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from > > > 200069684 to 210363848. E.g.: > > > > Which database is this on? > > > > > > > [...] > > > 266603:2020-03-14 08:49:59.927 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683":index scans: 0 > > > 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > > 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921 > > > 266606- buffer usage: 25 hits, 1 misses, 1 dirtied > > > 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s > > > 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > > 266609:2020-03-14 08:49:59.929 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694":index scans: 0 > > > 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > > 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > > > 266612- buffer usage: 25 hits, 1 misses, 1 dirtied > > > 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s > > > 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > > 266615:2020-03-14 08:49:59.931 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845":index scans: 0 > > > 266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > > 266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934 > > > 266618- buffer usage: 25 hits, 1 misses, 2 dirtied > > > 266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s > > > 266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s > > > 266621:2020-03-14 08:49:59.945 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712":index scans: 0 > > > 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen > > > 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953 > > > [...] > > > > Do you have any non-toast ones? > > This is occurring in our environment right now (started about 30 min > ago). Here 's the latest logs (grepped by vacuum): > > Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542 > GMT [15569] LOG: automatic vacuum of table "feedi.production.tita": > index scans: 1 > Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964 > GMT [15654] LOG: automatic vacuum of table > "feedi.production.distributed_virtual_schedule": index scans: 1 > Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103 > GMT [15654] LOG: automatic vacuum of table "feedi.production.tita": > index scans: 1 > Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284 > GMT [15654] LOG: automatic vacuum of table > "feedi.pg_toast.pg_toast_2619": index scans: 1 > Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005 > GMT [15759] LOG: automatic vacuum of table > "feedi.production.flightplans": index scans: 1 > Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979 > GMT [15759] LOG: automatic vacuum of table "feedi.production.tita": > index scans: 1 > Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367 > GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_statistic": index scans: 0 > Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378 > GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_type": index scans: 0 > Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389 > GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_foreign_server": index scans: 0 > Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410 > GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0 > Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420 > GMT [15814] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_user_mapping": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_attribute": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_proc": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_class": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_attrdef": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_constraint": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_inherits": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_index": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_operator": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_opfamily": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_opclass": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_am": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_amop": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_amproc": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_language": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_aggregate": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_largeobject": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_statistic_ext": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_rewrite": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_trigger": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_event_trigger": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_description": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_cast": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_enum": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_namespace": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_conversion": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_depend": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_db_role_setting": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_pltemplate": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_shdepend": index scans: 1 > Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_shdescription": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_ts_config": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_ts_config_map": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_ts_dict": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_ts_parser": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_ts_template": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_extension": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_foreign_table": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_policy": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_default_acl": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_init_privs": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_seclabel": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_collation": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_partitioned_table": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_range": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_transform": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_sequence": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_publication": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_publication_rel": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_catalog.pg_subscription_rel": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_packages": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_features": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_implementation_info": index > scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_parts": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_languages": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_sizing": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.information_schema.sql_sizing_profiles": index scans: > 0 > Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2600": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2604": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3456": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2606": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_826": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2609": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3466": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3079": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2328": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1417": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3118": index scans: 0 > Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3394": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2612": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2615": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3350": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3256": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1255": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2618": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3596": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2619": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3381": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3429": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2620": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3600": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1247": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1418": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1260": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1262": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2964": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1136": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_6000": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_2396": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_3592": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_6100": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_1213": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13267": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13272": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13287": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13277": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13282": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13292": index scans: 0 > Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317 > GMT [15890] LOG: automatic aggressive vacuum to prevent wraparound of > table "postgres.pg_toast.pg_toast_13297": index scans: 0 > > > > > > > > > I wonder if what might be happening is that we're somehow missed/failed > > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some > > > > table in the oldest database, but that is *NOT* the oldest table itself, > > > > does the problem "resolve" itself? Andres- I did confirm that if you run a manual vacuum on a table in the oldest database but NOT the oldest table, the problem does resolve itself. Here is the before and after: postgres=# SELECT datname , age(datfrozenxid) , current_setting('autovacuum_freeze_max_age') FROM pg_database; datname | age | current_setting -----------+-----------+----------------- postgres | 202375735 | 200000000 template1 | 202345459 | 200000000 template0 | 132459914 | 200000000 feedi | 132459914 | 200000000 (4 rows) postgres=# SELECT c.oid::regclass , age(c.relfrozenxid) , pg_size_pretty(pg_total_relation_size(c.oid)) , age(c.relfrozenxid)/200000000::float AS pct FROM pg_class c JOIN pg_namespace n on c.relnamespace = n.oid WHERE relkind IN ('r', 't', 'm') ORDER BY 2 DESC LIMIT 10; oid | age | pg_size_pretty | pct -----------------------+-----------+----------------+------------- pg_subscription | 203640681 | 24 kB | 1.018203405 pg_authid | 203640681 | 80 kB | 1.018203405 pg_tablespace | 203620449 | 80 kB | 1.018102245 pg_shseclabel | 203620449 | 16 kB | 1.018102245 pg_auth_members | 203620449 | 72 kB | 1.018102245 pg_database | 203620449 | 80 kB | 1.018102245 pg_replication_origin | 203620449 | 24 kB | 1.018102245 pg_statistic | 53646423 | 456 kB | 0.268232115 pg_statistic_ext_data | 53646423 | 16 kB | 0.268232115 pg_user_mapping | 53646423 | 24 kB | 0.268232115 (10 rows) postgres=# vacuum FREEZE VERBOSE ANALYZE pg_statistic; INFO: aggressively vacuuming "pg_catalog.pg_statistic" INFO: "pg_statistic": found 0 removable, 18 nonremovable row versions in 1 out of 38 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 39602876 There were 18 unused item identifiers. Skipped 0 pages due to buffer pins, 37 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: aggressively vacuuming "pg_toast.pg_toast_2619" INFO: "pg_toast_2619": found 0 removable, 13 nonremovable row versions in 4 out of 4 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 39602876 There were 12 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM postgres=# SELECT datname , age(datfrozenxid) , current_setting('autovacuum_freeze_max_age') FROM pg_database; datname | age | current_setting -----------+-----------+----------------- postgres | 54640707 | 200000000 template1 | 38248 | 200000000 template0 | 134719144 | 200000000 feedi | 134719144 | 200000000 (4 rows) Immediately after running this, all the autovac's starting working again. I've attached logfiles that might be easier to read. We are going to try and recreate the issue, in the meantime, I've turned on DEBUG1 and when it reoccurs, I'll run the queries in the "What Is" section as well. I also have a backtrace of the autovac launcher, but there appears to be nothing useful at all there (as you'd probably expect) and there were no autovac workers running. Thank you again- Justin > > postgres=# SELECT datname > , age(datfrozenxid) > , current_setting('autovacuum_freeze_max_age') > FROM pg_database; > datname | age | current_setting > -----------+-----------+----------------- > postgres | 202375735 | 200000000 > template1 | 202345459 | 200000000 > template0 | 132459914 | 200000000 > feedi | 132459914 | 200000000 > (4 rows) > > > > > > > > I'll also add pg_class snapshot for next time we run the bench. I'm not sure > > > if we'll be able to catch the 1h45 interval when the system stays alive > > > after the issue though. > > > > Could you just script something to stop the benchmark once the disk is > > 90% full or so? > > > > Did you see any errors / fatals around the time autovacuum stopped > > working? > > There are no ERROR or FATAL messages around the time -- (or at all). > > Since this is occurring right now, what else would be useful to > capture? You'd asked about a GDB -- do you want that of the main > process or the autovac worker? > > > > > Greetings, > > > > Andres Freund
Attachment
On Mon, Mar 23, 2020 at 01:00:51PM -0700, Andres Freund wrote: > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: >>> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database >>> SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; >> >> The vm has been trashed since, and I don't have that level of detail available >> in the gathered stats unfortunately (and the available information I have is a >> little bit messy, sorry for that). >> >>> - Oldest backend xmin >>> SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; >>> - oldest replication xmin: >>> SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; >>> - oldest slot xmin: >>> SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; >>> - oldest prepared transaction (unfortunately xmin not available) >>> SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; >> >> I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 >> (which may be dumb). Here are the non-null rows after 8AM GMT+1: This stuff does not use 2PC and logical slots (there is one physical slot for a WAL archiver), but we got a dump of pg_stat_activity. Grepping more info about the evolution of pg_database and pg_class is on our list. > Could you share what the config of the server was? Nothing really fancy: - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it in v11 as well). - autovacuum_naptime = 15s - autovacuum_max_workers = 6 - log_autovacuum_min_duration = 0 >>> In that case you'd likely see DEBUG1 output, right? Did you try running >>> with that? >> >> That's unfortunately not an option, as while the issue is reproducible, it >> happens after roughly 16h with an average of 6.5ktps, and we definitely don't >> have enough disk space for that amount of logs. We'll try next time with a >> recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog. > > You wouldn't need to run it with DEBUG1 the whole time, you could just > change the config and reload config once you hit the issue. Sure, though it would reduce the window when the cluster is still up and running. One option that we have is just to increase the elevel from DEBUG1 to WARNING for the log added in 2aa6e33 to keep the amount of logs reduced without losing information, and we have no issue to do runs with custom patches. >> Not for now, because it's hard to be around for the ~ 90min interval the system >> stays alive after atuvacuum get stuck, especially since it's a weekend bench. Yeah, our room is very limited here. I would likely not ne able to do that, or I would just get very lucky with my timezone. Not sure for Julien. >>> It sounds like what might be happening is that you have something >>> holding back the "oldest needed transaction" horizon. Before >>> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated >>> vacuums unable to increase the horizon, but afterwards they'll all >>> immediately exit without any messages. >> >> The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can see >> for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from >> 200069684 to 210363848. E.g.: > > Which database is this on? In this case this was the database used by the application, FWIW. But I can see from those logs that it kept increasing for all the other databases, like postgres or template1 when this set of aggressive jobs happened. >> 266621:2020-03-14 08:49:59.945 UTC LOG: automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712":index scans: 0 >> 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen >> 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953 >> [...] > > Do you have any non-toast ones? Yep, this includes catalogs and normal tables, based on the logs all the relations triggered aggressive and anti-wraparound jobs. >>> I wonder if what might be happening is that we're somehow missed/failed >>> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some >>> table in the oldest database, but that is *NOT* the oldest table itself, >>> does the problem "resolve" itself? >> >> I'll also add pg_class snapshot for next time we run the bench. I'm not sure >> if we'll be able to catch the 1h45 interval when the system stays alive >> after the issue though. > > Could you just script something to stop the benchmark once the disk is > 90% full or so? Hmm. I think that this one is possible still tricky. There are some alarms in place in this system. > Did you see any errors / fatals around the time autovacuum stopped > working? Before going rogue (we are not sure if autovacuum didn't launch any workers or if the workers were spawned and exited early as we did not capture any worker information in pg_stat_activity), we saw a bunch of aggressive wraparound jobs. Even after that, we have traces in the logs of one autovacuum analyze happening at equal interval of time (17 minutes) on one single table, which is... Er... uncommon to say the least. -- Michael
Attachment
Hi, On 2020-03-24 14:26:06 +0900, Michael Paquier wrote: > > Could you share what the config of the server was? > > Nothing really fancy: > - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it > in v11 as well). > - autovacuum_naptime = 15s > - autovacuum_max_workers = 6 > - log_autovacuum_min_duration = 0 Oh, so you're also involved in this? I'm starting to get a bit confused as to who is reporting what. > > Did you see any errors / fatals around the time autovacuum stopped > > working? > > Before going rogue (we are not sure if autovacuum didn't launch any > workers or if the workers were spawned and exited early as we did not > capture any worker information in pg_stat_activity), we saw a bunch of > aggressive wraparound jobs. Even after that, we have traces in the > logs of one autovacuum analyze happening at equal interval of time (17 > minutes) on one single table, which is... Er... uncommon to say the > least. Well, there's no logging of autovacuum launchers that don't do anything due to the "skipping redundant" logic, with normal log level. If somehow the horizon logic of autovacuum workers gets out of whack with what vacuumlazy.c does, then you'd not get any vacuums. But a usage level triggered analyze could still happen on such a table, I think. While looking at this issue I found a few problems, btw. That seems more like a -hackers discussion, so I started: https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de I think I might just have figured out another one... Greetings, Andres Freund
On Mon, Mar 23, 2020 at 10:40:39PM -0700, Andres Freund wrote: > On 2020-03-24 14:26:06 +0900, Michael Paquier wrote: >> Nothing really fancy: >> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it >> in v11 as well). >> - autovacuum_naptime = 15s >> - autovacuum_max_workers = 6 >> - log_autovacuum_min_duration = 0 > > Oh, so you're also involved in this? I'm starting to get a bit confused > as to who is reporting what. Yeah, sort of. Julien has done a lot of work on that and I have an access to the data and test beds, so we are just saying the same things. > Well, there's no logging of autovacuum launchers that don't do anything > due to the "skipping redundant" logic, with normal log level. If somehow > the horizon logic of autovacuum workers gets out of whack with what > vacuumlazy.c does, then you'd not get any vacuums. But a usage level > triggered analyze could still happen on such a table, I think. What surprised me the most is that the same table happened to be analyzed again and again after the launcher began its blackout. > While looking at this issue I found a few problems, btw. That seems more > like a -hackers discussion, so I started: > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de Yes, let's discuss there. > I think I might just have figured out another one... Ouch. -- Michael
Attachment
Hi, On 2020-03-24 15:12:38 +0900, Michael Paquier wrote: > > Well, there's no logging of autovacuum launchers that don't do anything > > due to the "skipping redundant" logic, with normal log level. If somehow > > the horizon logic of autovacuum workers gets out of whack with what > > vacuumlazy.c does, then you'd not get any vacuums. But a usage level > > triggered analyze could still happen on such a table, I think. > > What surprised me the most is that the same table happened to be > analyzed again and again after the launcher began its blackout. Well, if there's an issue with the "redundant" logic, that would be a not too surprising outcome. It's quite plausible that one or two tables in the database would get enough changes to occasionally need to be analyzed. If the workload is steady, that could e.g. work out to every ~17 minutes. All tables that autovacuum things are not wraparound threatened will be skipped, but ones that are will get both vacuum and analyze queued. The redundant logic could then entirely skip all vacuums - but there's no equivalent for analyze. > > While looking at this issue I found a few problems, btw. That seems more > > like a -hackers discussion, so I started: > > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de > > Yes, let's discuss there. Cool. Would also be good if you could expand on the thread introducing the "redundant" logic. Greetings, Andres Freund
All- This started happening again. DEBUG1 is enabled: Mar 25 14:48:03 cowtn postgres[39720]: [35294-1] 2020-03-25 14:48:03.972 GMT [39720] DEBUG: autovacuum: processing database "template0" Mar 25 14:48:06 cowtn postgres[39735]: [35294-1] 2020-03-25 14:48:06.545 GMT [39735] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:11 cowtn postgres[39759]: [35294-1] 2020-03-25 14:48:11.284 GMT [39759] DEBUG: autovacuum: processing database "template1" Mar 25 14:48:14 cowtn postgres[39772]: [35294-1] 2020-03-25 14:48:14.564 GMT [39772] DEBUG: autovacuum: processing database "feedi" Mar 25 14:48:14 cowtn postgres[39772]: [35295-1] 2020-03-25 14:48:14.588 GMT [39772] LOG: automatic vacuum of table "feedi.pg_catalog.pg_statistic": index scans: 1 Mar 25 14:48:14 cowtn postgres[39772]: [35295-2] #011pages: 0 removed, 117 remain, 0 skipped due to pins, 0 skipped frozen Mar 25 14:48:14 cowtn postgres[39772]: [35295-3] #011tuples: 477 removed, 704 remain, 0 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:14 cowtn postgres[39772]: [35295-4] #011buffer usage: 315 hits, 0 misses, 0 dirtied Mar 25 14:48:14 cowtn postgres[39772]: [35295-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s Mar 25 14:48:14 cowtn postgres[39772]: [35295-6] #011system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s Mar 25 14:48:15 cowtn postgres[39772]: [35296-1] 2020-03-25 14:48:15.606 GMT [39772] LOG: automatic vacuum of table "feedi.production.flightplans": index scans: 1 Mar 25 14:48:15 cowtn postgres[39772]: [35296-2] #011pages: 0 removed, 61331 remain, 0 skipped due to pins, 29772 skipped frozen Mar 25 14:48:15 cowtn postgres[39772]: [35296-3] #011tuples: 21807 removed, 509983 remain, 1581 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:15 cowtn postgres[39772]: [35296-4] #011buffer usage: 113684 hits, 0 misses, 1 dirtied Mar 25 14:48:15 cowtn postgres[39772]: [35296-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.008 MB/s Mar 25 14:48:15 cowtn postgres[39772]: [35296-6] #011system usage: CPU: user: 0.64 s, system: 0.12 s, elapsed: 1.00 s Mar 25 14:48:16 cowtn postgres[39772]: [35297-1] 2020-03-25 14:48:16.537 GMT [39772] LOG: automatic analyze of table "feedi.production.flightplans" system usage: CPU: user: 0.83 s, system: 0.03 s, elapsed: 0.93 s Mar 25 14:48:16 cowtn postgres[39772]: [35298-1] 2020-03-25 14:48:16.627 GMT [39772] LOG: automatic vacuum of table "feedi.production.tita": index scans: 1 Mar 25 14:48:16 cowtn postgres[39772]: [35298-2] #011pages: 0 removed, 1711 remain, 0 skipped due to pins, 0 skipped frozen Mar 25 14:48:16 cowtn postgres[39772]: [35298-3] #011tuples: 5936 removed, 75280 remain, 1079 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:16 cowtn postgres[39772]: [35298-4] #011buffer usage: 5748 hits, 0 misses, 1 dirtied Mar 25 14:48:16 cowtn postgres[39772]: [35298-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.100 MB/s Mar 25 14:48:16 cowtn postgres[39772]: [35298-6] #011system usage: CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s Mar 25 14:48:16 cowtn postgres[39772]: [35299-1] 2020-03-25 14:48:16.959 GMT [39772] LOG: automatic analyze of table "feedi.production.tita" system usage: CPU: user: 0.32 s, system: 0.00 s, elapsed: 0.33 s Mar 25 14:48:16 cowtn postgres[39772]: [35300-1] 2020-03-25 14:48:16.969 GMT [39772] LOG: automatic analyze of table "feedi.production.virtual_clocks" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s Mar 25 14:48:18 cowtn postgres[39790]: [35294-1] 2020-03-25 14:48:18.975 GMT [39790] DEBUG: autovacuum: processing database "template0" Mar 25 14:48:21 cowtn postgres[39799]: [35294-1] 2020-03-25 14:48:21.546 GMT [39799] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:22 cowtn postgres[39853]: [35294-1] 2020-03-25 14:48:22.446 GMT [39853] LOG: connection received: host=10.4.4.11 port=25424 Mar 25 14:48:22 cowtn postgres[39853]: [35295-1] 2020-03-25 14:48:22.451 GMT [39853] LOG: connection authorized: user=feedi database=feedi SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) Mar 25 14:48:26 cowtn postgres[39875]: [35294-1] 2020-03-25 14:48:26.277 GMT [39875] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:26 cowtn postgres[39875]: [35295-1] 2020-03-25 14:48:26.298 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_authid" Mar 25 14:48:26 cowtn postgres[39875]: [35296-1] 2020-03-25 14:48:26.309 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_subscription" Mar 25 14:48:26 cowtn postgres[39875]: [35297-1] 2020-03-25 14:48:26.319 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_database" Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_tablespace" Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25 14:48:26.339 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_auth_members" Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25 14:48:26.350 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_replication_origin" Mar 25 14:48:26 cowtn postgres[39875]: [35301-1] 2020-03-25 14:48:26.360 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_shseclabel" Mar 25 14:48:26 cowtn postgres[39875]: [35302-1] 2020-03-25 14:48:26.361 GMT [39875] DEBUG: transaction ID wrap limit is 2137086523, limited by database with OID 13432 Mar 25 14:48:26 cowtn postgres[39875]: [35303-1] 2020-03-25 14:48:26.361 GMT [39875] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 13432 Mar 25 14:48:26 cowtn postgres[39875]: [35304-1] 2020-03-25 14:48:26.361 GMT [39875] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1 postgres=# SELECT datname , oid , age(datfrozenxid) , current_setting('autovacuum_freeze_max_age') FROM pg_database; datname | oid | age | current_setting -----------+----------+-----------+----------------- postgres | 13432 | 202588645 | 200000000 template1 | 1 | 152588645 | 200000000 template0 | 13431 | 87271781 | 200000000 feedi | 32178861 | 87271781 | 200000000 Here is the output from the "What is" queries referenced: postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; datname | age | datfrozenxid ----------+-----------+-------------- postgres | 202773709 | 4284570172 postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1; oid | age | relfrozenxid -----------+-----------+-------------- pg_authid | 202793549 | 4284570172 postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; pid | backend_xmin | age -------+--------------+------ 17032 | 192412343 | 7199 17033 | 192412343 | 7199 52320 | 192419542 | 0 (3 rows) postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; pid | backend_xmin | age -------+--------------+------ 17032 | 192434622 | 2004 17033 | 192434622 | 2004 (2 rows) postgres=# SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; slot_name | xmin | age | catalog_xmin | age -----------+------+-----+--------------+----- (0 rows) postgres=# SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; gid | database | transaction -----+----------+------------- (0 rows) Let me know if there's anything else useful I can provide. Thanks- Justin On Tue, Mar 24, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-24 15:12:38 +0900, Michael Paquier wrote: > > > Well, there's no logging of autovacuum launchers that don't do anything > > > due to the "skipping redundant" logic, with normal log level. If somehow > > > the horizon logic of autovacuum workers gets out of whack with what > > > vacuumlazy.c does, then you'd not get any vacuums. But a usage level > > > triggered analyze could still happen on such a table, I think. > > > > What surprised me the most is that the same table happened to be > > analyzed again and again after the launcher began its blackout. > > Well, if there's an issue with the "redundant" logic, that would be a > not too surprising outcome. It's quite plausible that one or two tables > in the database would get enough changes to occasionally need to be > analyzed. If the workload is steady, that could e.g. work out to every > ~17 minutes. All tables that autovacuum things are not wraparound > threatened will be skipped, but ones that are will get both vacuum and > analyze queued. The redundant logic could then entirely skip all > vacuums - but there's no equivalent for analyze. > > > > While looking at this issue I found a few problems, btw. That seems more > > > like a -hackers discussion, so I started: > > > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de > > > > Yes, let's discuss there. > > Cool. Would also be good if you could expand on the thread introducing > the "redundant" logic. > > Greetings, > > Andres Freund
On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote: > This started happening again. DEBUG1 is enabled: Thanks for enabling DEBUG1 logs while this happened. > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 > 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent > wraparound of table "postgres.pg_catalog.pg_tablespace" > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25 > 14:48:26.339 GMT [39875] DEBUG: skipping redundant vacuum to prevent > wraparound of table "postgres.pg_catalog.pg_auth_members" > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25 > 14:48:26.350 GMT [39875] DEBUG: skipping redundant vacuum to prevent > wraparound of table "postgres.pg_catalog.pg_replication_origin" Are you seeing such log entries happening multiple times for the same relations, meaning that autovacuum workers are just looping on the same relations all over again? This part of the logs point to catalog tables, but are there other tables within your system facing the same logs, particularly the database "feedi" with some of your own tables? > postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT > 1; > oid | age | relfrozenxid > -----------+-----------+-------------- > pg_authid | 202793549 | 4284570172 Ugh. I think that this is exactly the thing I was suspecting previously: - The database stats look sane. - The relation stats don't look good and visibly are put in such a state that only one type of jobs gets triggered (non-aggressive but anti-wraparound), which just keep being skipped and the relation stats don't get refreshed. (Such autovacuum jobs should never happen and we have some underlying issues that will need separate care). If you still have the cluster in this current state (perhaps you are not able to keep it longer), could you provide more data about pg_class.relfrozenxid for the tables which are mentioned in the logs of the type "skipping redundant vacuum to prevent of table"? > Let me know if there's anything else useful I can provide. Thanks! -- Michael
Attachment
Hi, On 2020-03-26 10:43:36 +0900, Michael Paquier wrote: > On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote: > > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 > > 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_tablespace" > > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25 > > 14:48:26.339 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_auth_members" > > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25 > > 14:48:26.350 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_replication_origin" FWIW, this kind of thing is why I think the added skipping logic is a bad idea. Silently skipping things like this (same with the "bogus" logic in datfrozenxid computation) is dangerous. I think we should seriously consider backing this change out. And if not, then we should at least include enough detail in the message to be able to debug this. > > postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM > > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT > > 1; > > oid | age | relfrozenxid > > -----------+-----------+-------------- > > pg_authid | 202793549 | 4284570172 > > Ugh. I think that this is exactly the thing I was suspecting > previously: > - The database stats look sane. > - The relation stats don't look good and visibly are put in such a > state that only one type of jobs gets triggered (non-aggressive but > anti-wraparound), which just keep being skipped and the relation stats > don't get refreshed. (Such autovacuum jobs should never happen and we > have some underlying issues that will need separate care). Hm. Why is this a state that is clearly invalid compared to pg_database? Seems to precisely match > postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM > pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; > datname | age | datfrozenxid > ----------+-----------+-------------- > postgres | 202773709 | 4284570172 And why should this lead to anti-wraparound vacuums not happening? This is older than the the cutoff age? xid 4284570172 having the age of 202 million xids suggests that ReadNewTransactionId() is approx 192376585. Which comports with the log saying: oldest xmin: 189591147. Or are you saying that you conclude that the relcache entry is somehow out of date? It sure is interesting that all of the tables that hit the "skipping redundant vacuum" condition are shared tables. Greetings, Andres Freund
On Wed, Mar 25, 2020 at 07:59:56PM -0700, Andres Freund wrote: > FWIW, this kind of thing is why I think the added skipping logic is a > bad idea. Silently skipping things like this (same with the "bogus" > logic in datfrozenxid computation) is dangerous. I think we should > seriously consider backing this change out. That's actually what I would like to do at this stage as a first step. It looks pretty clear that it does not help. > And if not, then we should at least include enough detail in the message > to be able to debug this. Sure. In any attempt I have done until now I was easily able to skip some jobs, but it should get easier with a higher number of concurrent workers and a higher number of relations heavily updated. Thinking about it, only catalog jobs were getting skipped in my own runs... >> postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM >> pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; >> datname | age | datfrozenxid >> ----------+-----------+-------------- >> postgres | 202773709 | 4284570172 > > And why should this lead to anti-wraparound vacuums not happening? This > is older than the the cutoff age? > > xid 4284570172 having the age of 202 million xids suggests that > ReadNewTransactionId() is approx 192376585. Which comports with the log > saying: oldest xmin: 189591147. Oops, sorry. My previous email was incorrect. It looked strange to not see datfrozenxid being refreshed. > Or are you saying that you conclude that the relcache entry is somehow > out of date? It sure is interesting that all of the tables that hit the > "skipping redundant vacuum" condition are shared tables. Yeah, that's actually what I was thinking yesterday. In heap_vacuum_rel(), xidFullScanLimit may be calculated right, but an incorrect value of rd_rel->relminmxid or rd_rel->relfrozenxid could lead to a job to become not aggressive. It should be actually easy enough to check that. -- Michael
Attachment
On Wed, Mar 25, 2020 at 8:43 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote: > > This started happening again. DEBUG1 is enabled: > > Thanks for enabling DEBUG1 logs while this happened. > > > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 > > 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_tablespace" > > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25 > > 14:48:26.339 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_auth_members" > > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25 > > 14:48:26.350 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_replication_origin" > > Are you seeing such log entries happening multiple times for the same > relations, meaning that autovacuum workers are just looping on the > same relations all over again? This part of the logs point to catalog > tables, but are there other tables within your system facing the same > logs, particularly the database "feedi" with some of your own tables? Nope, it was just these tables that were looping over and over while nothing else was getting autovac'd. I'm happy to share the full log if you'd like. > > > postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM > > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT > > 1; > > oid | age | relfrozenxid > > -----------+-----------+-------------- > > pg_authid | 202793549 | 4284570172 > > Ugh. I think that this is exactly the thing I was suspecting > previously: > - The database stats look sane. > - The relation stats don't look good and visibly are put in such a > state that only one type of jobs gets triggered (non-aggressive but > anti-wraparound), which just keep being skipped and the relation stats > don't get refreshed. (Such autovacuum jobs should never happen and we > have some underlying issues that will need separate care). > > If you still have the cluster in this current state (perhaps you are > not able to keep it longer), could you provide more data about > pg_class.relfrozenxid for the tables which are mentioned in the logs > of the type "skipping redundant vacuum to prevent of table"? I did have to remove it from this state, but I can undo my workaround and, undoubtedly, it'll end up back there. Let me know if there's something specific you'd like me to provide when it happens! > > > Let me know if there's anything else useful I can provide. > > Thanks! > -- > Michael
On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote: > Nope, it was just these tables that were looping over and over while > nothing else was getting autovac'd. I'm happy to share the full log > if you'd like. Thanks, that could help. If that's very large, it could be a problem to send that to the lists, but you could send me directly a link to it and I'll try to extract more information for the lists. While testing for reproducing the issue, I have noticed that basically one set of catalog tables happened to see this "skipping redundant" log. And I am wondering if we have a match with the set of catalog tables looping. > I did have to remove it from this state, but I can undo my workaround > and, undoubtedly, it'll end up back there. Let me know if there's > something specific you'd like me to provide when it happens! For now I think it's fine. Note that Julien and I have an environment where the issue can be reproduced easily (it takes roughly 12 hours until the wraparound cutoffs are reached with the benchmark and settings used), and we are checking things using a patched instance with 2aa6e33 reverted. I think that we are accumulating enough evidence that this change was not a good idea anyway thanks to the information you sent, so likely we'll finish first by a revert of 2aa6e33 from the master and REL_12_STABLE branches, before looking at the issues with the catalogs for those anti-wraparound and non-aggressive jobs (this looks like a relcache issue with the so-said catalogs). -- Michael
Attachment
On Fri, Mar 27, 2020 at 02:12:04PM +0900, Michael Paquier wrote: > On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote: > > Nope, it was just these tables that were looping over and over while > > nothing else was getting autovac'd. I'm happy to share the full log > > if you'd like. > > Thanks, that could help. If that's very large, it could be a problem > to send that to the lists, but you could send me directly a link to > it and I'll try to extract more information for the lists. While > testing for reproducing the issue, I have noticed that basically one > set of catalog tables happened to see this "skipping redundant" log. > And I am wondering if we have a match with the set of catalog tables > looping. > > > I did have to remove it from this state, but I can undo my workaround > > and, undoubtedly, it'll end up back there. Let me know if there's > > something specific you'd like me to provide when it happens! > > For now I think it's fine. Note that Julien and I have an environment > where the issue can be reproduced easily (it takes roughly 12 hours > until the wraparound cutoffs are reached with the benchmark and > settings used), and we are checking things using a patched instance > with 2aa6e33 reverted. I think that we are accumulating enough > evidence that this change was not a good idea anyway thanks to the > information you sent, so likely we'll finish first by a revert of > 2aa6e33 from the master and REL_12_STABLE branches, before looking at > the issues with the catalogs for those anti-wraparound and > non-aggressive jobs (this looks like a relcache issue with the so-said > catalogs). FTR we reached the 200M transaxtion earlier, and I can see multiple logs of the form "automatic vacuum to prevent wraparound", so non-aggressive antiwraparound autovacuum, all on shared relations. As those vacuum weren't skipped, autovacuum didn't get stuck in a loop on those and continue its work normally. This happened ~ 4h ago, didn't ocurred again while the 200M threshold was reached again multiple time.
On Fri, Mar 27, 2020 at 12:12 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote: > > Nope, it was just these tables that were looping over and over while > > nothing else was getting autovac'd. I'm happy to share the full log > > if you'd like. > > Thanks, that could help. If that's very large, it could be a problem > to send that to the lists, but you could send me directly a link to > it and I'll try to extract more information for the lists. While > testing for reproducing the issue, I have noticed that basically one > set of catalog tables happened to see this "skipping redundant" log. > And I am wondering if we have a match with the set of catalog tables > looping. Sounds great. I will email you directly with a link! > > I did have to remove it from this state, but I can undo my workaround > > and, undoubtedly, it'll end up back there. Let me know if there's > > something specific you'd like me to provide when it happens! > > For now I think it's fine. Note that Julien and I have an environment > where the issue can be reproduced easily (it takes roughly 12 hours > until the wraparound cutoffs are reached with the benchmark and > settings used), and we are checking things using a patched instance > with 2aa6e33 reverted. I think that we are accumulating enough > evidence that this change was not a good idea anyway thanks to the > information you sent, so likely we'll finish first by a revert of > 2aa6e33 from the master and REL_12_STABLE branches, before looking at > the issues with the catalogs for those anti-wraparound and > non-aggressive jobs (this looks like a relcache issue with the so-said > catalogs). This is encouraging. As I mentioned, we have a workaround in place for the moment, but don't hesitate if you need anything else from me. Thanks for jumping in on the thread, it was nice validation to know that I wasn't the only one seeing the issue! > -- > Michael
On Fri, Mar 27, 2020 at 05:10:03PM -0500, Justin King wrote: > Sounds great. I will email you directly with a link! Thanks. From the logs, the infinite loop on which autovacuum jobs are stuck is clear. We have a repetitive number of anti-wraparound and non-aggressive jobs happening for 7 shared relations every 5 seconds or so and autovacuum is not able to do something else: pg_auth_members pg_authid pg_database pg_replication_origin pg_shseclabel pg_subscription pg_tablespace However I have found a correlation between all the tests I have done on my own laptop, all the tests Julien and I have done in our internal company buildfarm, and your logs. We have 10 shared catalogs in the system and the problem does not happen with the three following ones, unable to trigger an anti-wraparound non-aggressive job: pg_db_role_setting pg_shdepend pg_shdescription Not sure yet what that is, but it is hard to believe that it is just a coincidence. > This is encouraging. As I mentioned, we have a workaround in place for > the moment, but don't hesitate if you need anything else from me. > Thanks for jumping in on the thread, it was nice validation to know > that I wasn't the only one seeing the issue! Yes. We have been able to confirm that 2aa6e33 is a direct cause of your problem. -- Michael
Attachment
On Fri, Mar 27, 2020 at 08:23:03PM +0100, Julien Rouhaud wrote: > FTR we reached the 200M transaxtion earlier, and I can see multiple logs of the > form "automatic vacuum to prevent wraparound", so non-aggressive antiwraparound > autovacuum, all on shared relations. Thanks Julien for sharing the information with the community. Note that three shared relations did not run such jobs though. > As those vacuum weren't skipped, autovacuum didn't get stuck in a loop on those > and continue its work normally. This happened ~ 4h ago, didn't ocurred again > while the 200M threshold was reached again multiple time. As we have confirmed that 2aa6e33 is actually the origin of the problem, I am planning to revert it at the beginning of next week, and then dig more into the origin of those anti-wraparound and non-aggressive jobs happening for a portion of the shared relations. Andres has started a thread about the problem on -hackers here: https://www.postgresql.org/message-id/20200323235036.6pje6usrjjx22zv3@alap3.anarazel.de And I'll follow up there with anything new I find. Please let me know if there are any objections with the revert though, this will address the problem reported by Justin. -- Michael
Attachment
Hi, On 2020-03-28 17:47:19 +0900, Michael Paquier wrote: > On Fri, Mar 27, 2020 at 05:10:03PM -0500, Justin King wrote: > > This is encouraging. As I mentioned, we have a workaround in place for > > the moment, but don't hesitate if you need anything else from me. > > Thanks for jumping in on the thread, it was nice validation to know > > that I wasn't the only one seeing the issue! > > Yes. We have been able to confirm that 2aa6e33 is a direct cause of > your problem. I assume you're still trying to track the actual cause of the problem further? Greetings, Andres Freund
On Sat, Mar 28, 2020 at 11:29:41AM -0700, Andres Freund wrote: > I assume you're still trying to track the actual cause of the problem > further? That's the plan, and I'll try to spend some time on it next week. Any new information I have will be added to the thread you have begun on -hackers a couple of days ago. -- Michael
Attachment
On Sat, Mar 28, 2020 at 05:53:59PM +0900, Michael Paquier wrote: > And I'll follow up there with anything new I find. Please let me know > if there are any objections with the revert though, this will address > the problem reported by Justin. Okay. Done with this part now as of dd9ac7d. Now for the older issue.. -- Michael