Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru
Date
Msg-id 20181009.181536.142257785.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru  (Michael Paquier <michael@paquier.xyz>)
Responses Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru  (Andrew Dunstan <andrew.dunstan@2ndquadrant.com>)
List pgsql-hackers
At Fri, 5 Oct 2018 15:35:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181005063504.GB14664@paquier.xyz>
> On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
> > So, I have come back to this stuff, and finished with the attached
> > instead, so as the assertion is in a single place.  I find that
> > clearer.  The comments have also been improved.  Thoughts?
> 
> And so...  I have been looking at committing this thing, and while
> testing in-depth I have been able to trigger a case where an autovacuum
> has been able to be not aggressive but anti-wraparound, which is exactly
> what should not be possible, no?  I have simply created an instance with
> autovacuum_freeze_max_age = 200000, then ran pgbench with
> autovacuum_freeze_table_age=200000 set for each table, and also ran
> installcheck-world in parallel.  This has been able to trigger the
> assertion pretty quickly.

I investigated it and in short, it can happen.

It is a kind of race consdition between two autovacuum
processes. do_autovacuum() looks into pg_class (using a snapshot)
and vacuum_set_xid_limits() looks into relcache. If concurrent
vacuum happens and one has finished the relation, another gets
relcache invalidation and relfrozenxid is updated. If this
happens between do_autovacuum() and vacuum_set_xid_limits(), the
latter sees newer relfrozenxid than the former. The problem
happens when it moves by more than 5% of
autovacuum_freeze_max_age.

If lazy_vacuum_rel() sees the situation, the relation is already
aggressively vacuumed by a cocurrent worker. We can just ingore
the state safely but also we know that the vacuum is useless.

1. Just allow the case there (and add comment).
   Causes redundant anti-wraparound vacuum.

2. Skip the relation by the condition.

   I think that we can safely skip the relation in the
   case. (attached)

3. Ensure that do_autovacuum always sees the same relfrozenxid
   with vacuum_set_xid_limits().

4. Prevent concurrent acuuming of the same relation rigorously,
  somehow.

Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..436d573454 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -249,6 +249,21 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
     if (options & VACOPT_DISABLE_PAGE_SKIPPING)
         aggressive = true;
 
+    /*
+     * When is_wraparound, relfrozenxid is old enough and aggressive must be
+     * true here. However, if another concurrent vacuum process has processed
+     * this relation, relfrozenxid in relcache can be moved forward enough so
+     * that aggressive is calculated as false here. This relation no longer
+     * needs to be vacuumed. Bail out.
+     */
+    if (params->is_wraparound && !aggressive)
+    {
+        elog(DEBUG1, "relation %d has been vacuumd ocncurrently, skip",
+            onerel->rd_id);
+        return;
+    }
+
+
     vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
     vacrelstats->old_rel_pages = onerel->rd_rel->relpages;

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: partition tree inspection functions
Next
From: Amit Langote
Date:
Subject: Re: partition tree inspection functions