Giorgio Saviane <gsaviane@gmail.com> writes:
> Let me clarify.
> The sequence of events that led our databases to the wraparound issue is
> the following:
> 1. The application creates temporary tables that should be by default
> cleaned at the end of the session. As a matter of fact the pooler doesn't
> actually close the session, neither - apparently - issues a DISCARD ALL,
> creating leaks among sessions.
> 2. Since sessions are kept alive with stale temporary tables the
> autovacuum process cannot do maintenance on transaction ids.
> 3. Since we have no evidence on what is going on, the database reaches
> the 1 million stop limit to avoid transaction wraparound.
> 4. The application gets then out of order. We stop the Postgres server
> gracefully and we enter single user mode to run the vacuum
> 5. Any attempt to vacuum the database ends up with the usual "WARNING:
> database "mydb" must be vacuumed within XXX transactions" message,
> revealing that the vacuum could not fix up the transaction ids.
So, actually, that was the wrong recovery method. All you really
needed to do was terminate that long-running session(s) and wait for
a minute or two. There is code to delete old orphaned temp tables,
even as far back as 9.5. But *it's part of autovacuum*. Dropping
down to single-user mode prevents it from running.
I thought briefly about whether VACUUM in single-user mode should
also run that code; but I'm not really convinced it'd be a good thing
to be auto-dropping tables in single-user mode. And anyway we shouldn't
be encouraging use of single-user mode, and it'd be much better to not
be in this situation to start with.
What seems like the most helpful thing we could do is teach autovacuum to
issue log warnings when it sees old temp tables holding back datfrozenxid.
The trick here is to not spam the log too much. After a few moment's
thought, maybe warn if:
* a table is temp but not orphaned, and
* its relfrozenxid or relminmxid is equal to datfrozenxid or datminmxid,
proving that some previous autovac run saw it as holding back those
horizons, and
* those xids are older than the relevant freeze limit.
For extra credit maybe the message could mention the PID of the
session owning the table?
regards, tom lane