Re: PG12 autovac issues - Mailing list pgsql-general

From Andres Freund
Subject Re: PG12 autovac issues
Date
Msg-id 20200323162303.s7ay5hjdvimtkz6u@alap3.anarazel.de
Whole thread Raw
In response to Re: PG12 autovac issues  (Julien Rouhaud <rjuju123@gmail.com>)
Responses Re: PG12 autovac issues
List pgsql-general
Hi,

On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > When we get into this state again, is there some other information
> > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > would be useful for folks here to help understand what is going on?
> > 
> > If it's actually stuck on a single table, and that table is not large,
> > it would be useful to get a backtrace with gdb.
> 
> FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> during performance tests since a recent upgrade to pg12 .
>
> What seems to be happening is that after reaching 200M transaction a first pass
> of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> afterwards).

If you have older transactions around that'd not be surprising. Do you
have autovacuum logging output for this case?


> After that point, all available information seems to indicate that no
> autovacuum worker is scheduled anymore:

Do you mean that this table doesn't get autovac'ed at all anymore, that
no table is getting autovac'd, or just that there's nothing further
increasing relfrozenxid for that table?

It sounds like:

> - log_autovacuum_min_duration is set to 0 and no activity is logged (while
>   having thousands of those per hour before that)

no table at all?


> - 15 min interval snapshot of pg_database and pg_class shows that
>   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
>   goes down

I assume you mean their age?

What is:
- datfrozenxid, age(datfrozenxid) for the oldest database
  SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
- relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
  SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 
- Oldest backend xmin
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESCLIMIT 3;
 
- oldest replication xmin:
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
- oldest slot xmin:
  SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
- oldest prepared transaction (unfortunately xmin not available)
  SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;


> The same bench was run against pg11 many times and never triggered this issue.
> So far our best guess is a side effect of 2aa6e331ead7.

In that case you'd likely see DEBUG1 output, right? Did you try running
with that?


> Michael and I have been trying to reproduce this issue locally (drastically
> reducing the various freeze_age parameters) for hours, but no luck for now.

Have you considered using gdb to investigate?


> This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> relevant configuration changes are quite aggressive autovacuum settings on some
> tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> factor to 0, for both heap and toast).

That, uh, is not an insignificant set of changes for an autovac
scheduling issues.

It sounds like what might be happening is that you have something
holding back the "oldest needed transaction" horizon. Before
2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
vacuums unable to increase the horizon, but afterwards they'll all
immediately exit without any messages.

I wonder if what might be happening is that we're somehow missed/failed
to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
table in the oldest database, but that is *NOT* the oldest table itself,
does the problem "resolve" itself?

Greetings,

Andres Freund



pgsql-general by date:

Previous
From: Michael Lewis
Date:
Subject: Re: Runtime partition pruning
Next
From: Kouber Saparev
Date:
Subject: Re: Is it safe to rename an index through pg_class update?