Thread: Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Alvaro Herrera
Date:
On 2018-Sep-13, Michael Paquier wrote:

> Improve autovacuum logging for aggressive and anti-wraparound runs
> 
> A log message was being generated when log_min_duration is reached for
> autovacuum on a given relation to indicate if it was an aggressive run,
> and missed the point of mentioning if it is doing an anti-wrapround
> run.  The log message generated is improved so as one, both or no extra
> details are added depending on the option set.

Hmm, can a for-wraparound vacuum really not be aggressive?  I think one
of those four cases is really dead code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Sep 14, 2018 at 12:35:54PM -0300, Alvaro Herrera wrote:
> On 2018-Sep-13, Michael Paquier wrote:
>> Improve autovacuum logging for aggressive and anti-wraparound runs
>>
>> A log message was being generated when log_min_duration is reached for
>> autovacuum on a given relation to indicate if it was an aggressive run,
>> and missed the point of mentioning if it is doing an anti-wrapround
>> run.  The log message generated is improved so as one, both or no extra
>> details are added depending on the option set.
>
> Hmm, can a for-wraparound vacuum really not be aggressive?  I think one
> of those four cases is really dead code.

Sure, at the same time it is a no-brainer to keep the code as is, and
seeing log messages where (!aggressive && wraparound) would be an
indication of surrounding bugs, no?  There have been issues in this area
in the past.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Robert Haas
Date:
On Fri, Sep 14, 2018 at 11:35 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2018-Sep-13, Michael Paquier wrote:
>> Improve autovacuum logging for aggressive and anti-wraparound runs
>>
>> A log message was being generated when log_min_duration is reached for
>> autovacuum on a given relation to indicate if it was an aggressive run,
>> and missed the point of mentioning if it is doing an anti-wrapround
>> run.  The log message generated is improved so as one, both or no extra
>> details are added depending on the option set.
>
> Hmm, can a for-wraparound vacuum really not be aggressive?  I think one
> of those four cases is really dead code.

My first question was whether TWO of them were dead code ... isn't an
aggressive vacuum to prevent wraparound, and a vacuum to prevent
wraparound aggressive?

I can't figure out what this is giving us that we didn't have before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

From
Sergei Kornilov
Date:
Hello, Robert

> My first question was whether TWO of them were dead code ... isn't an
> aggressive vacuum to prevent wraparound, and a vacuum to prevent
> wraparound aggressive?
Maybe i am wrong, aggressive autovacuum was your commit.
Message split was in b55509332f50f998b6e8b3830a51c5b9d8f666aa
Aggressive autovacuum was in fd31cd265138019dcccc9b5fe53043670898bc9f

If aggressive really is wraparound without difference - i think we need refactor this code, it is difficult have two
differentflags for same purpose.
 

But as far i can see it is possible have aggressive non-wraparound vacuum. One important difference - regular and
aggressiveregular can be canceled by backend,.wraparound autovacuum can not. (by checking PROC_VACUUM_FOR_WRAPAROUND in
src/backend/storage/lmgr/proc.c)
 

regards, Sergei


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andres Freund
Date:
Hi,

On 2018-09-21 20:38:16 +0300, Sergei Kornilov wrote:
> > My first question was whether TWO of them were dead code ... isn't an
> > aggressive vacuum to prevent wraparound, and a vacuum to prevent
> > wraparound aggressive?
> Maybe i am wrong, aggressive autovacuum was your commit.
> Message split was in b55509332f50f998b6e8b3830a51c5b9d8f666aa
> Aggressive autovacuum was in fd31cd265138019dcccc9b5fe53043670898bc9f
> 
> If aggressive really is wraparound without difference - i think we need refactor this code, it is difficult have two
differentflags for same purpose.
 
> 
> But as far i can see it is possible have aggressive non-wraparound vacuum. One important difference - regular and
aggressiveregular can be canceled by backend,.wraparound autovacuum can not. (by checking PROC_VACUUM_FOR_WRAPAROUND in
src/backend/storage/lmgr/proc.c)
 

Yes, without checking the code, they should be different. Aggressive is
controlled by vacuum_freeze_table_age whereas anti-wrap is controlled by
autovacuum_freeze_max_age (but also implies aggressive).

Greetings,

Andres Freund


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
"Nasby, Jim"
Date:
> On Sep 21, 2018, at 12:43 PM, Andres Freund <andres@anarazel.de> wrote:
> 
>> But as far i can see it is possible have aggressive non-wraparound vacuum. One important difference - regular and
aggressiveregular can be canceled by backend,.wraparound autovacuum can not. (by checking PROC_VACUUM_FOR_WRAPAROUND in
src/backend/storage/lmgr/proc.c)
 
> 
> Yes, without checking the code, they should be different. Aggressive is
> controlled by vacuum_freeze_table_age whereas anti-wrap is controlled by
> autovacuum_freeze_max_age (but also implies aggressive).

Right, except that by the time you get into the vacuum code itself nothing should really care about that difference.
AFAICT,the only thing is_wraparound is being used for is to set MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND,
whichprevents the deadlock detector from killing an autovac process that’s trying to prevent a wraparound. I think it’d
beclearer to remove is_wraparound and move the check from vacuum_rel() into lazy_vacuum_rel() (which is where the
limitsfor HeapTupleSatisfiesVacuum get determined). Something like the attached. 
Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andres Freund
Date:
On 2018-09-24 18:25:46 +0000, Nasby, Jim wrote:
> 
> > On Sep 21, 2018, at 12:43 PM, Andres Freund <andres@anarazel.de> wrote:
> > 
> >> But as far i can see it is possible have aggressive non-wraparound vacuum. One important difference - regular and
aggressiveregular can be canceled by backend,.wraparound autovacuum can not. (by checking PROC_VACUUM_FOR_WRAPAROUND in
src/backend/storage/lmgr/proc.c)
 
> > 
> > Yes, without checking the code, they should be different. Aggressive is
> > controlled by vacuum_freeze_table_age whereas anti-wrap is controlled by
> > autovacuum_freeze_max_age (but also implies aggressive).
> 
> Right, except that by the time you get into the vacuum code itself nothing should really care about that difference.
AFAICT,the only thing is_wraparound is being used for is to set MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND,
whichprevents the deadlock detector from killing an autovac process that’s trying to prevent a wraparound. I think it’d
beclearer to remove is_wraparound and move the check from vacuum_rel() into lazy_vacuum_rel() (which is where the
limitsfor HeapTupleSatisfiesVacuum get determined). Something like the attached.
 

I'm very doubtful this is an improvement. Especially with the upcoming
pluggable storage work making vacuumlazy.c heap specific, while vacuum.c
stays generic.  The concept of something like
PROC_VACUUM_FOR_WRAPAROUND, should imo not be pushed down that much
(even if criteria for it might).

Greetings,

Andres Freund


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
"Nasby, Jim"
Date:
> On Sep 24, 2018, at 1:29 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> I'm very doubtful this is an improvement. Especially with the upcoming
> pluggable storage work making vacuumlazy.c heap specific, while vacuum.c
> stays generic.  The concept of something like
> PROC_VACUUM_FOR_WRAPAROUND, should imo not be pushed down that much
> (even if criteria for it might).

That’s already a problem since vacuum logging is spread all over while autovac logging is not. Perhaps there needs to
besome sort of vacuum_log() function that immediately provides output for manual vacuums, but aggregates output for
autovac.AFAIK that’s the only real reason for autovac logging being a special case today. 

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

From
Sergei Kornilov
Date:
Hi

> An autovacuum can't be just aggressive; it's either anti-wraparound or normal.
But autovacuum _can_ be aggressive and not anti-wraparound.
I build current master and can see 3 different line types:
2018-09-24 23:47:31.500 MSK 27939 @ from  [vxid:4/272032 txid:0] [] LOG:  automatic aggressive vacuum of table
"postgres.public.foo":index scans: 0
 
2018-09-24 23:49:27.892 MSK 28333 @ from  [vxid:4/284297 txid:0] [] LOG:  automatic aggressive vacuum to prevent
wraparoundof table "postgres.public.foo": index scans: 0
 
2018-09-24 23:49:29.093 MSK 28337 @ from  [vxid:4/284412 txid:0] [] LOG:  automatic vacuum of table
"postgres.public.foo":index scans: 0
 

regards, Sergei


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Alvaro Herrera
Date:
On 2018-Sep-24, Sergei Kornilov wrote:

> Hi
> 
> > An autovacuum can't be just aggressive; it's either anti-wraparound or normal.
> But autovacuum _can_ be aggressive and not anti-wraparound.
> I build current master and can see 3 different line types:
> 2018-09-24 23:47:31.500 MSK 27939 @ from  [vxid:4/272032 txid:0] [] LOG:  automatic aggressive vacuum of table
"postgres.public.foo":index scans: 0
 
> 2018-09-24 23:49:27.892 MSK 28333 @ from  [vxid:4/284297 txid:0] [] LOG:  automatic aggressive vacuum to prevent
wraparoundof table "postgres.public.foo": index scans: 0
 
> 2018-09-24 23:49:29.093 MSK 28337 @ from  [vxid:4/284412 txid:0] [] LOG:  automatic vacuum of table
"postgres.public.foo":index scans: 0
 

Exactly.

It cannot be anti-wraparound and not aggressive, which is the line type
not shown.

"Aggressive" means it scans all pages; "anti-wraparound" means it does
not let itself be cancelled because of another process waiting for a
lock on the table.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Masahiko Sawada
Date:
On Tue, Sep 25, 2018 at 6:11 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2018-Sep-24, Sergei Kornilov wrote:
>
> > Hi
> >
> > > An autovacuum can't be just aggressive; it's either anti-wraparound or normal.
> > But autovacuum _can_ be aggressive and not anti-wraparound.
> > I build current master and can see 3 different line types:
> > 2018-09-24 23:47:31.500 MSK 27939 @ from  [vxid:4/272032 txid:0] [] LOG:  automatic aggressive vacuum of table
"postgres.public.foo":index scans: 0
 
> > 2018-09-24 23:49:27.892 MSK 28333 @ from  [vxid:4/284297 txid:0] [] LOG:  automatic aggressive vacuum to prevent
wraparoundof table "postgres.public.foo": index scans: 0
 
> > 2018-09-24 23:49:29.093 MSK 28337 @ from  [vxid:4/284412 txid:0] [] LOG:  automatic vacuum of table
"postgres.public.foo":index scans: 0
 
>
> Exactly.
>
> It cannot be anti-wraparound and not aggressive, which is the line type
> not shown.
>
> "Aggressive" means it scans all pages; "anti-wraparound" means it does
> not let itself be cancelled because of another process waiting for a
> lock on the table.
>

I agree. Can we fix this simply by the attached patch?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Sep 28, 2018 at 01:53:14PM +0900, Masahiko Sawada wrote:
> I agree. Can we fix this simply by the attached patch?

Thanks for sending a patch.

+    /* autovacuum cannot be anti-wraparound and not aggressive vacuum */
+    Assert(aggressive);
+    msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");

While adding this comment in lazy_vacuum_rel() is adapted, I think that
we ought to make the assertion check more aggressive by not having it
depend on if log_min_duration is set or not.  So why not moving that to
a place a bit higher, where aggressive gets defined?
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Masahiko Sawada
Date:
On Tue, Oct 2, 2018 at 9:11 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 28, 2018 at 01:53:14PM +0900, Masahiko Sawada wrote:
> > I agree. Can we fix this simply by the attached patch?
>
> Thanks for sending a patch.
>
> +    /* autovacuum cannot be anti-wraparound and not aggressive vacuum */
> +    Assert(aggressive);
> +    msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
>
> While adding this comment in lazy_vacuum_rel() is adapted, I think that
> we ought to make the assertion check more aggressive by not having it
> depend on if log_min_duration is set or not. So why not moving that to
> a place a bit higher, where aggressive gets defined?

Since there is no place where checks params->is_wraparound we will
have to do something like;

if (params->is_wraparound)
    Assert(aggressive);

Or you meant the following?

Assert(params->is_wraparound ? aggressive : true);

I'm not sure both styles would be appropriate style in the postgres
code so I would rather add elog(ERROR) instead. Thought?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Tue, Oct 02, 2018 at 01:18:01PM +0900, Masahiko Sawada wrote:
> I'm not sure both styles would be appropriate style in the postgres
> code so I would rather add elog(ERROR) instead. Thought?

My brain is rather fried for the rest of the day...  But we could just
be looking at using USE_ASSERT_CHECKING.  Thoughts from other are
welcome.
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> My brain is rather fried for the rest of the day...  But we could just
> be looking at using USE_ASSERT_CHECKING.  Thoughts from other are
> welcome.

I'd go with folding the condition into a plain Assert.  Then it's
obvious that no code is added in a non-assert build.  I can see that
some cases might be so complicated that that isn't nice, but this
case doesn't seem to qualify.

            regards, tom lane


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Masahiko Sawada
Date:
On Tue, Oct 2, 2018 at 11:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Michael Paquier <michael@paquier.xyz> writes:
> > My brain is rather fried for the rest of the day...  But we could just
> > be looking at using USE_ASSERT_CHECKING.  Thoughts from other are
> > welcome.
>
> I'd go with folding the condition into a plain Assert.  Then it's
> obvious that no code is added in a non-assert build.  I can see that
> some cases might be so complicated that that isn't nice, but this
> case doesn't seem to qualify.
>

Thank you for the comment. Attached the updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> Thank you for the comment. Attached the updated patch.

Thanks for the patch.  This looks clean to me at quick glance, not
tested though..
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> Thank you for the comment. Attached the updated patch.

So, I have come back to this stuff, and finished with the attached
instead, so as the assertion is in a single place.  I find that
clearer.  The comments have also been improved.  Thoughts?
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Masahiko Sawada
Date:
On Fri, Oct 5, 2018 at 12:16 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> > Thank you for the comment. Attached the updated patch.
>
> So, I have come back to this stuff, and finished with the attached
> instead, so as the assertion is in a single place.  I find that
> clearer.  The comments have also been improved.  Thoughts?

Thank you! The patch looks good to me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
> So, I have come back to this stuff, and finished with the attached
> instead, so as the assertion is in a single place.  I find that
> clearer.  The comments have also been improved.  Thoughts?

And so...  I have been looking at committing this thing, and while
testing in-depth I have been able to trigger a case where an autovacuum
has been able to be not aggressive but anti-wraparound, which is exactly
what should not be possible, no?  I have simply created an instance with
autovacuum_freeze_max_age = 200000, then ran pgbench with
autovacuum_freeze_table_age=200000 set for each table, and also ran
installcheck-world in parallel.  This has been able to trigger the
assertion pretty quickly.

Here is the stack trace:
#2  0x000055e1a12ef87b in ExceptionalCondition
 (conditionName=0x55e1a14b45c8 "!((params->is_wraparound &&
 aggressive) || !params->is_wraparound)",
     errorType=0x55e1a14b459d "FailedAssertion", fileName=0x55e1a14b4590
 "vacuumlazy.c", lineNumber=254) at assert.c:54
#3  0x000055e1a0f6c6ad in lazy_vacuum_rel (onerel=0x7f163e7ea710,
options=65, params=0x55e1a2eeba70, bstrategy=0x55e1a2f5c1a0) at
vacuumlazy.c:253
#4  0x000055e1a0f6c217 in vacuum_rel (relid=1260,
relation=0x55e1a2f5d748, options=65, params=0x55e1a2eeba70) at
vacuum.c:1714
#5  0x000055e1a0f6a3ac in vacuum (options=65, relations=0x55e1a2f2f050,
params=0x55e1a2eeba70, bstrategy=0x55e1a2f5c1a0, isTopLevel=true) at
vacuum.c:340
#6  0x000055e1a10c1ddd in autovacuum_do_vac_analyze
(tab=0x55e1a2eeba68, bstrategy=0x55e1a2f5c1a0) at autovacuum.c:3121
#7  0x000055e1a10c0e19 in do_autovacuum () at autovacuum.c:2476

$2 = {spcNode = 1664, dbNode = 0, relNode = 1260}
(gdb) p onerel->rd_node.relNode
$3 = 1260
(gdb) p params->is_wraparound
$4 = true
(gdb) p aggressive
$5 = false

I still have the core file, the binaries and the data folder, so it's
not a problem to dig into it.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Kyotaro HORIGUCHI
Date:
At Fri, 5 Oct 2018 15:35:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181005063504.GB14664@paquier.xyz>
> On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
> > So, I have come back to this stuff, and finished with the attached
> > instead, so as the assertion is in a single place.  I find that
> > clearer.  The comments have also been improved.  Thoughts?
> 
> And so...  I have been looking at committing this thing, and while
> testing in-depth I have been able to trigger a case where an autovacuum
> has been able to be not aggressive but anti-wraparound, which is exactly
> what should not be possible, no?  I have simply created an instance with
> autovacuum_freeze_max_age = 200000, then ran pgbench with
> autovacuum_freeze_table_age=200000 set for each table, and also ran
> installcheck-world in parallel.  This has been able to trigger the
> assertion pretty quickly.

I investigated it and in short, it can happen.

It is a kind of race consdition between two autovacuum
processes. do_autovacuum() looks into pg_class (using a snapshot)
and vacuum_set_xid_limits() looks into relcache. If concurrent
vacuum happens and one has finished the relation, another gets
relcache invalidation and relfrozenxid is updated. If this
happens between do_autovacuum() and vacuum_set_xid_limits(), the
latter sees newer relfrozenxid than the former. The problem
happens when it moves by more than 5% of
autovacuum_freeze_max_age.

If lazy_vacuum_rel() sees the situation, the relation is already
aggressively vacuumed by a cocurrent worker. We can just ingore
the state safely but also we know that the vacuum is useless.

1. Just allow the case there (and add comment).
   Causes redundant anti-wraparound vacuum.

2. Skip the relation by the condition.

   I think that we can safely skip the relation in the
   case. (attached)

3. Ensure that do_autovacuum always sees the same relfrozenxid
   with vacuum_set_xid_limits().

4. Prevent concurrent acuuming of the same relation rigorously,
  somehow.

Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..436d573454 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -249,6 +249,21 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
     if (options & VACOPT_DISABLE_PAGE_SKIPPING)
         aggressive = true;
 
+    /*
+     * When is_wraparound, relfrozenxid is old enough and aggressive must be
+     * true here. However, if another concurrent vacuum process has processed
+     * this relation, relfrozenxid in relcache can be moved forward enough so
+     * that aggressive is calculated as false here. This relation no longer
+     * needs to be vacuumed. Bail out.
+     */
+    if (params->is_wraparound && !aggressive)
+    {
+        elog(DEBUG1, "relation %d has been vacuumd ocncurrently, skip",
+            onerel->rd_id);
+        return;
+    }
+
+
     vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
     vacrelstats->old_rel_pages = onerel->rd_rel->relpages;

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andrew Dunstan
Date:
On 10/9/18 5:15 AM, Kyotaro HORIGUCHI wrote:
> At Fri, 5 Oct 2018 15:35:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20181005063504.GB14664@paquier.xyz>
>> On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
>>> So, I have come back to this stuff, and finished with the attached
>>> instead, so as the assertion is in a single place.  I find that
>>> clearer.  The comments have also been improved.  Thoughts?
>> And so...  I have been looking at committing this thing, and while
>> testing in-depth I have been able to trigger a case where an autovacuum
>> has been able to be not aggressive but anti-wraparound, which is exactly
>> what should not be possible, no?  I have simply created an instance with
>> autovacuum_freeze_max_age = 200000, then ran pgbench with
>> autovacuum_freeze_table_age=200000 set for each table, and also ran
>> installcheck-world in parallel.  This has been able to trigger the
>> assertion pretty quickly.
> I investigated it and in short, it can happen.
>
> It is a kind of race consdition between two autovacuum
> processes. do_autovacuum() looks into pg_class (using a snapshot)
> and vacuum_set_xid_limits() looks into relcache. If concurrent
> vacuum happens and one has finished the relation, another gets
> relcache invalidation and relfrozenxid is updated. If this
> happens between do_autovacuum() and vacuum_set_xid_limits(), the
> latter sees newer relfrozenxid than the former. The problem
> happens when it moves by more than 5% of
> autovacuum_freeze_max_age.
>
> If lazy_vacuum_rel() sees the situation, the relation is already
> aggressively vacuumed by a cocurrent worker. We can just ingore
> the state safely but also we know that the vacuum is useless.
>
> 1. Just allow the case there (and add comment).
>    Causes redundant anti-wraparound vacuum.
>
> 2. Skip the relation by the condition.
>
>    I think that we can safely skip the relation in the
>    case. (attached)
>
> 3. Ensure that do_autovacuum always sees the same relfrozenxid
>    with vacuum_set_xid_limits().
>
> 4. Prevent concurrent acuuming of the same relation rigorously,
>   somehow.
>
> Thoughts?
>


I notice that this seems never to have been acted on. I think we should
apply this and remove the (confusing) message setting for the case we'll
now be avoiding. If not we should at least comment there that this is a
case we only expect to see in pathological cases.


cheers


andrew



-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Mar 08, 2019 at 05:05:53PM -0500, Andrew Dunstan wrote:
> I notice that this seems never to have been acted on. I think we should
> apply this and remove the (confusing) message setting for the case we'll
> now be avoiding. If not we should at least comment there that this is a
> case we only expect to see in pathological cases.

Sorry for dropping the ball, I would have assumed that Robert would
handle it as he is at the origin of the introduction of the aggressive
option via fd31cd26.

+    elog(DEBUG1, "relation %d has been vacuumd ocncurrently, skip",
The proposed patch has two typos in two words.

I am adding an open item about that.  I think I could commit the
patch, but I need to study it a bit more first.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Sat, Mar 09, 2019 at 10:15:37AM +0900, Michael Paquier wrote:
> I am adding an open item about that.  I think I could commit the
> patch, but I need to study it a bit more first.

So, coming back to this thread, and studying the problem again, it
looks that the diagnostic that a non-aggressive, anti-wraparound
vacuum could be triggered because the worker sees trouble in the
force because of some activity happening in parallel.  Hence, if we
face this case, it looks right to skip the vacuum for this relation.

Attached is an updated patch with a better error message, more
comments, and the removal of the anti-wraparound non-aggressive log
which was added in 28a8fa9.  The error message could be better I
guess.  Suggestions are welcome.

Thoughts?
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andrew Dunstan
Date:
On 3/29/19 7:51 AM, Michael Paquier wrote:
> On Sat, Mar 09, 2019 at 10:15:37AM +0900, Michael Paquier wrote:
>> I am adding an open item about that.  I think I could commit the
>> patch, but I need to study it a bit more first.
> So, coming back to this thread, and studying the problem again, it
> looks that the diagnostic that a non-aggressive, anti-wraparound
> vacuum could be triggered because the worker sees trouble in the
> force because of some activity happening in parallel.  Hence, if we
> face this case, it looks right to skip the vacuum for this relation.
>
> Attached is an updated patch with a better error message, more
> comments, and the removal of the anti-wraparound non-aggressive log
> which was added in 28a8fa9.  The error message could be better I
> guess.  Suggestions are welcome.
>
> Thoughts?


+                (errmsg_internal("found vacuum to prevent wraparound of
table \"%s.%s.%s\" to be not aggressive, so skipping",

This might convey something to hackers, but I doubt it will convey much
to regular users. Perhaps something like "skipping redundant
anti-wraparound vacuum of table ..." would be better.


The comment is also a bit too tentative. Perhaps something like this
would do:


    Normally the relfrozenxid for an anti-wraparound vacuum will be old
    enough to force an aggressive vacuum. However, a concurrent vacuum
    might have already done this work that the relfroxzenxid in relcache
    has been updated. If that happens this vacuum is redundant, so skip it.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Mar 29, 2019 at 09:11:47AM -0400, Andrew Dunstan wrote:
> +                (errmsg_internal("found vacuum to prevent wraparound of
> table \"%s.%s.%s\" to be not aggressive, so skipping",
>
> This might convey something to hackers, but I doubt it will convey much
> to regular users. Perhaps something like "skipping redundant
> anti-wraparound vacuum of table ..." would be better.

"skipping redundant" is much better.

> The comment is also a bit too tentative. Perhaps something like this
> would do:
>
>     Normally the relfrozenxid for an anti-wraparound vacuum will be old
>     enough to force an aggressive vacuum. However, a concurrent vacuum
>     might have already done this work that the relfroxzenxid in relcache
>     has been updated. If that happens this vacuum is redundant, so skip it.

That works for me.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Alvaro Herrera
Date:
On 2019-Mar-29, Michael Paquier wrote:

> On Fri, Mar 29, 2019 at 09:11:47AM -0400, Andrew Dunstan wrote:
> > +                (errmsg_internal("found vacuum to prevent wraparound of
> > table \"%s.%s.%s\" to be not aggressive, so skipping",
> > 
> > This might convey something to hackers, but I doubt it will convey much
> > to regular users. Perhaps something like "skipping redundant
> > anti-wraparound vacuum of table ..." would be better.
> 
> "skipping redundant" is much better.

Yeah, that looks good to me too.  I wonder if we really need it as LOG
though; we don't say anything for actions unless they take more than the
min duration, so why say something for a no-op that takes almost no time?
Maybe make it DEBUG1.

> > The comment is also a bit too tentative. Perhaps something like this
> > would do:
> > 
> >     Normally the relfrozenxid for an anti-wraparound vacuum will be old
> >     enough to force an aggressive vacuum. However, a concurrent vacuum
> >     might have already done this work that the relfroxzenxid in relcache
> >     has been updated. If that happens this vacuum is redundant, so skip it.
> 
> That works for me.

s/relfroxzenxid/relfrozenxid/

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Fri, Mar 29, 2019 at 11:22:55AM -0300, Alvaro Herrera wrote:
> Yeah, that looks good to me too.  I wonder if we really need it as LOG
> though; we don't say anything for actions unless they take more than the
> min duration, so why say something for a no-op that takes almost no time?
> Maybe make it DEBUG1.

I think that this does not justify a WARNING, as that's harmless for
the user even if we use WARNING for other skips (see
vacuum_is_relation_owner).  However DEBUG1 is also too low in my
opinion as this log can be used as an indicator that autovacuum is too
much aggressive because there are too many workers for example.  I
have seen that matter in some CPU-bound environments.  I won't fight
hard if the consensus is to use DEBUG1 though.  So, more opinions?
Andrew perhaps?

> s/relfroxzenxid/relfrozenxid/

Sure.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andrew Dunstan
Date:
On 3/29/19 9:08 PM, Michael Paquier wrote:
> On Fri, Mar 29, 2019 at 11:22:55AM -0300, Alvaro Herrera wrote:
>> Yeah, that looks good to me too.  I wonder if we really need it as LOG
>> though; we don't say anything for actions unless they take more than the
>> min duration, so why say something for a no-op that takes almost no time?
>> Maybe make it DEBUG1.
> I think that this does not justify a WARNING, as that's harmless for
> the user even if we use WARNING for other skips (see
> vacuum_is_relation_owner).  However DEBUG1 is also too low in my
> opinion as this log can be used as an indicator that autovacuum is too
> much aggressive because there are too many workers for example.  I
> have seen that matter in some CPU-bound environments.  I won't fight
> hard if the consensus is to use DEBUG1 though.  So, more opinions?
> Andrew perhaps?
>
>


It's really just a matter of housekeeping as I see it, so probably
DEBUG1 is right.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Sat, Mar 30, 2019 at 08:16:21AM -0400, Andrew Dunstan wrote:
> It's really just a matter of housekeeping as I see it, so probably
> DEBUG1 is right.

Okay, I'll use that then.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Sat, Mar 30, 2019 at 10:12:33PM +0900, Michael Paquier wrote:
> Okay, I'll use that then.

And finally committed.  I have changed the debug1 message so as "to
prevent wraparound" is used instead of "anti-wraparound".  I have
noticed something which was also missing from all the patches proposed
on this thread: a reset of the progress state was not done, and we
need to call pgstat_progress_end_command() in order to do that.
--
Michael

Attachment

Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Andres Freund
Date:
Hi,

On 2019-03-29 20:51:38 +0900, Michael Paquier wrote:
> So, coming back to this thread, and studying the problem again, it
> looks that the diagnostic that a non-aggressive, anti-wraparound
> vacuum could be triggered because the worker sees trouble in the
> force because of some activity happening in parallel.  Hence, if we
> face this case, it looks right to skip the vacuum for this relation.
> 
> Attached is an updated patch with a better error message, more
> comments, and the removal of the anti-wraparound non-aggressive log
> which was added in 28a8fa9.  The error message could be better I
> guess.  Suggestions are welcome.

> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 5c554f9465..82be8c81f3 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -248,6 +248,25 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
>      if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
>          aggressive = true;
>  
> +    /*
> +     * When running an anti-wraparound vacuum, we expect relfrozenxid to be
> +     * old enough so as aggressive is always set.  If this is not the case,
> +     * it could be possible that another concurrent vacuum process has done
> +     * the work for this relation so that relfrozenxid in relcache has
> +     * already been moved forward enough, causing this vacuum run to be
> +     * non-aggressive.  If that happens, note that this relation no longer
> +     * needs to be vacuumed, so just skip it.
> +     */
> +    if (params->is_wraparound && !aggressive)
> +    {
> +        ereport(LOG,
> +                (errmsg_internal("found vacuum to prevent wraparound of table \"%s.%s.%s\" to be not aggressive, so
skipping",
> +                                 get_database_name(MyDatabaseId),
> +                                 get_namespace_name(RelationGetNamespace(onerel)),
> +                                 RelationGetRelationName(onerel))));
> +        return;
> +    }
> +

Which valid scenario can lead to this? Neither the comment, nor commit
message explain it. Unless you're thinking of scenarios where autovacuum
and manual vacuum are mixed, I don't really see valid reasons? Normally
autovacuum's locking + the table_recheck_autovac() check should prevent
problematic scenarios.

I do see a few scenarios that can trigger this - but they all more or
less are bugs.

It doesn't strike me as a good idea to work around such bugs by silently
neutering heap_vacuum_rel(). The likelihood of that temporarily covering
up more severe problems seems significant - they're likely to then later
bite you with a cluster shutdown.

Greetings,

Andres Freund



Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru

From
Michael Paquier
Date:
On Mon, Mar 23, 2020 at 06:41:50PM -0700, Andres Freund wrote:
> Which valid scenario can lead to this? Neither the comment, nor commit
> message explain it.

The commit message mentions that concurrent autovacuum jobs can lead
to the creation of non-aggressive and anti-wraparound jobs, which have
no sense because an aggressive and anti-wraparound job was already
done in parallel with a different worker, and that this was possible
because of inconsistent relcache lookups across concurrent jobs.  This
was mentioned upthread.

> Unless you're thinking of scenarios where autovacuum
> and manual vacuum are mixed, I don't really see valid reasons? Normally
> autovacuum's locking + the table_recheck_autovac() check should prevent
> problematic scenarios.
>
> I do see a few scenarios that can trigger this - but they all more or
> less are bugs.

Hmm.  OK.

> It doesn't strike me as a good idea to work around such bugs by silently
> neutering heap_vacuum_rel(). The likelihood of that temporarily covering
> up more severe problems seems significant - they're likely to then later
> bite you with a cluster shutdown.

Saying that, I have been thinking about this one for a couple of days
now and it seems to me that this is a factor contributing to what we
are seeing in [1], and I agree that this is just an incorrect approach
that makes easier to trigger the real underlying issues, while
table_recheck_autovac() ought to be the only code path doing the skip
job.  Note that I have failed to reproduce the behavior of the other
thread though, always finishing with a non-aggressive anti-wraparound
skipped because of an aggressive and anti-wraparound job happened just
before in parallel, and autovacuum was always able to continue
triggering new jobs, keeping the relfrozenxid age at bay.

So I would like to first revert that part, to have a cleaner state to
work on the underlying issues.  A pure revert means also adding back
the log message for non-aggressive and anti-wraparound jobs that
should never exist, which should be replaced by an assertion once all
the holes are fixed.  What do you think?

[1]: https://www.postgresql.org/message-id/CAE39h23RTX1jkYjWc5tccv34HwwraizaCUxOmdQdPM+Zt5-2Qg@mail.gmail.com
--
Michael

Attachment