Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain - Mailing list pgsql-bugs

From Giorgio Saviane
Subject Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
Date
Msg-id CAHs6c0fTSor3WHd_CSsGThtOSAgO2PAcCT+M+uF7hcgQsQvYtQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Ok, let me add one more step in that sequence.
Before shutting down Postgres and going down to single mode we stopped the application server and other services around it. I'm pretty sure that there was a reasonable window of time before we went to single mode. Moreover, if the database already reached the stop limit for preventing the wraparound, was it still possible for autovacuum to kick in in multiuser mode?
Wasn't the graceful database shutdown enough to let Postgres drop the temporary tables?

Kind regards

Giorgio Saviane

Il giorno sab 12 set 2020 alle ore 17:49 Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
Giorgio Saviane <gsaviane@gmail.com> writes:
> Let me clarify.
> The sequence of events that led our databases to the wraparound issue is
> the following:

>    1. The application creates temporary tables that should be by default
>    cleaned at the end of the session. As a matter of fact the pooler doesn't
>    actually close the session, neither - apparently - issues a DISCARD ALL,
>    creating leaks among sessions.
>    2. Since sessions are kept alive with stale temporary tables the
>    autovacuum process cannot do maintenance on transaction ids.
>    3. Since we have no evidence on what is going on, the database reaches
>    the 1 million stop limit to avoid transaction wraparound.
>    4. The application gets then out of order. We stop the Postgres server
>    gracefully and we enter single user mode to run the vacuum
>    5. Any attempt to vacuum the database ends up with the usual "WARNING:
>    database "mydb" must be vacuumed within XXX transactions" message,
>    revealing that the vacuum could not fix up the transaction ids.

So, actually, that was the wrong recovery method.  All you really
needed to do was terminate that long-running session(s) and wait for
a minute or two.  There is code to delete old orphaned temp tables,
even as far back as 9.5.  But *it's part of autovacuum*.  Dropping
down to single-user mode prevents it from running.

I thought briefly about whether VACUUM in single-user mode should
also run that code; but I'm not really convinced it'd be a good thing
to be auto-dropping tables in single-user mode.  And anyway we shouldn't
be encouraging use of single-user mode, and it'd be much better to not
be in this situation to start with.

What seems like the most helpful thing we could do is teach autovacuum to
issue log warnings when it sees old temp tables holding back datfrozenxid.
The trick here is to not spam the log too much.  After a few moment's
thought, maybe warn if:

* a table is temp but not orphaned, and
* its relfrozenxid or relminmxid is equal to datfrozenxid or datminmxid,
proving that some previous autovac run saw it as holding back those
horizons, and
* those xids are older than the relevant freeze limit.

For extra credit maybe the message could mention the PID of the
session owning the table?

                        regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
Next
From: Tom Lane
Date:
Subject: Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain