Re: PG12 autovac issues - Mailing list pgsql-general
From | Andres Freund |
---|---|
Subject | Re: PG12 autovac issues |
Date | |
Msg-id | 20200323162303.s7ay5hjdvimtkz6u@alap3.anarazel.de Whole thread Raw |
In response to | Re: PG12 autovac issues (Julien Rouhaud <rjuju123@gmail.com>) |
Responses |
Re: PG12 autovac issues
|
List | pgsql-general |
Hi, On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote: > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote: > > Hi, > > > > On 2020-03-20 12:42:31 -0500, Justin King wrote: > > > When we get into this state again, is there some other information > > > (other than what is in pg_stat_statement or pg_stat_activity) that > > > would be useful for folks here to help understand what is going on? > > > > If it's actually stuck on a single table, and that table is not large, > > it would be useful to get a backtrace with gdb. > > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc) > during performance tests since a recent upgrade to pg12 . > > What seems to be happening is that after reaching 200M transaction a first pass > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k > (age(datfrozenxid) still being more than autovacuum_freeze_max_age > afterwards). If you have older transactions around that'd not be surprising. Do you have autovacuum logging output for this case? > After that point, all available information seems to indicate that no > autovacuum worker is scheduled anymore: Do you mean that this table doesn't get autovac'ed at all anymore, that no table is getting autovac'd, or just that there's nothing further increasing relfrozenxid for that table? It sounds like: > - log_autovacuum_min_duration is set to 0 and no activity is logged (while > having thousands of those per hour before that) no table at all? > - 15 min interval snapshot of pg_database and pg_class shows that > datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never > goes down I assume you mean their age? What is: - datfrozenxid, age(datfrozenxid) for the oldest database SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1; - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid)DESC LIMIT 1; - Oldest backend xmin SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESCLIMIT 3; - oldest replication xmin: SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)DESC LIMIT 3; - oldest slot xmin: SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; - oldest prepared transaction (unfortunately xmin not available) SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3; > The same bench was run against pg11 many times and never triggered this issue. > So far our best guess is a side effect of 2aa6e331ead7. In that case you'd likely see DEBUG1 output, right? Did you try running with that? > Michael and I have been trying to reproduce this issue locally (drastically > reducing the various freeze_age parameters) for hours, but no luck for now. Have you considered using gdb to investigate? > This is using a vanilla pg 12.1, with some OLTP workload. The only possibly > relevant configuration changes are quite aggressive autovacuum settings on some > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale > factor to 0, for both heap and toast). That, uh, is not an insignificant set of changes for an autovac scheduling issues. It sounds like what might be happening is that you have something holding back the "oldest needed transaction" horizon. Before 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated vacuums unable to increase the horizon, but afterwards they'll all immediately exit without any messages. I wonder if what might be happening is that we're somehow missed/failed to update relfrozenxid and/or datfrozenxid. If you manually vacuum some table in the oldest database, but that is *NOT* the oldest table itself, does the problem "resolve" itself? Greetings, Andres Freund
pgsql-general by date: