Thread: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16614 Logged by: Giorgio Saviane Email address: gsaviane@gmail.com PostgreSQL version: 9.5.23 Operating system: Linux Description: We noticed the same issue described below occurring multiple times in some of our many Postgres 9.5 deployments. The application makes extensive use of temporary tables. Although there are scheduled activities for regularly vacuuming the database, after some months of uptime the database gets into the 1 million transactions limit. Even though we perform a manual maintenance by vacuuming in single user mode, the server keeps returning the same warning and hint: WARNING: database "mydb" must be vacuumed within (X<1000000) transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". Any further vacuum command issued completes but the warning keeps showing up, and the remaining transactions are decremented. Only by removing all the stale temporary objects with DO $$ DECLARE rec RECORD; BEGIN FOR rec IN SELECT 'DROP schema ' || nspname || ' CASCADE' AS stmt FROM pg_catalog.pg_namespace WHERE nspname LIKE 'pg_temp%' OR nspname LIKE 'pg_toast_temp%' LOOP EXECUTE rec.stmt; END LOOP; END; $$ Makes the vacuum command complete without warnings. We think that stale temporary objects are somehow blocking the vacuuming of the database.
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > We noticed the same issue described below occurring multiple times in some > of our many Postgres 9.5 deployments. > The application makes extensive use of temporary tables. Although there are > scheduled activities for regularly vacuuming the database, after some months > of uptime the database gets into the 1 million transactions limit. Couple of possibilities here ... Autovacuum, and manual vacuum too, cannot do anything with temp tables belonging to other sessions. (It's physically impossible because the live data for a temp table is likely to only exist inside the memory of the owning session.) So, if you have temp tables that survive long enough to need vacuuming because of wraparound considerations, then the owning session has to do that itself. Alternatively, maybe you don't have any temp-table-owning sessions that live that long, but for some reason some temp tables failed to get dropped when their owning session exited. (AFAIK this could only happen as part of a database crash --- is that a common occurrence for you?) They're still there and still holding back the global XID horizon, so the only way to fix things is to forcibly drop them. PG versions later than 9.5 have added increasingly aggressive rules for automatically dropping such orphan temp tables. From memory, 9.5 does have some behavior of that sort; but it's evidently not firing for you. What you'll need to do is find the old tables (look for pg_class entries with large values of age(relfrozenxid)) and manually DROP those tables. I would not recommend the wholesale removal of temp schemas shown in your script, because that carries too much risk of removing tables that are in active use. regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Michael Paquier
Date:
On Thu, Sep 10, 2020 at 10:43:54AM -0400, Tom Lane wrote: > Alternatively, maybe you don't have any temp-table-owning sessions > that live that long, but for some reason some temp tables failed to > get dropped when their owning session exited. (AFAIK this could only > happen as part of a database crash --- is that a common occurrence > for you?) They're still there and still holding back the global XID > horizon, so the only way to fix things is to forcibly drop them. > > PG versions later than 9.5 have added increasingly aggressive rules for > automatically dropping such orphan temp tables. From memory, 9.5 does > have some behavior of that sort; but it's evidently not firing for you. > What you'll need to do is find the old tables (look for pg_class entries > with large values of age(relfrozenxid)) and manually DROP those tables. Yeah. 11 and newer versions have been made even more aggressive with the cleanup of orphaned tables in autovacuum, particularly the case where a backend reuses the ID of a past session that crashed, leaving behind some temporary tables. Perhaps that was the case here? > I would not recommend the wholesale removal of temp schemas shown in > your script, because that carries too much risk of removing tables > that are in active use. Dropping directly a temporary schema would put sessions still relying on them in geopardy, as sessions that have created at least one temporary table store statically the namespace OID of the temp schema. The result will get fun if you create some temp tables afterwards on those live sessions, and unlikely the result you'd expect, but the operation can be useful in some cases as that's a no-brainer, and only a superuser can do that. Dropping directly the tables from another session would be fine. -- Michael
Attachment
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > Yeah. 11 and newer versions have been made even more aggressive with > the cleanup of orphaned tables in autovacuum, particularly the case > where a backend reuses the ID of a past session that crashed, leaving > behind some temporary tables. Perhaps that was the case here? Off-list, the OP indicated that the problem is actually of the other category, ie the problematic tables belong to extremely long-lived sessions that are managed by a connection pooler. I don't quite understand why the pooler isn't issuing DISCARD ALL between clients, but anyway it seems that PG is operating as designed here. regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Michael Paquier
Date:
On Fri, Sep 11, 2020 at 10:21:21AM -0400, Tom Lane wrote: > Off-list, the OP indicated that the problem is actually of the other > category, ie the problematic tables belong to extremely long-lived > sessions that are managed by a connection pooler. I don't quite > understand why the pooler isn't issuing DISCARD ALL between clients, > but anyway it seems that PG is operating as designed here. Oh, OK, thanks! That was not mentioned originally so I got confused. -- Michael
Attachment
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Giorgio Saviane
Date:
The thing is that we realized this only when the wraparound stop limit was reached, that seems an inevitable cul-de-sac if your application is leaking temporary tables. Even worse, when you try to vacuum in single user mode there is nothing that tells you "drop those bloody temp tables". It keeps telling you that less than 1 mln transactions are remaining, without explaining why. We lost an entire database of one of our customers before being aware of the fact that vacuum is ineffective with temp tables.
The question is: can you implement some warning thrown by vacuum/autovacuum in case orphaned temporary tables are in the way?
The connection pooler we are using is the one provided by Tomcat 7. I'm gonna check if it supports DISCARD ALL on close.
Kind regards
P.S. sorry for replying here, couldn't get how to reply on the thread publicly.
Il giorno ven 11 set 2020 alle ore 16:21 Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
Michael Paquier <michael@paquier.xyz> writes:
> Yeah. 11 and newer versions have been made even more aggressive with
> the cleanup of orphaned tables in autovacuum, particularly the case
> where a backend reuses the ID of a past session that crashed, leaving
> behind some temporary tables. Perhaps that was the case here?
Off-list, the OP indicated that the problem is actually of the other
category, ie the problematic tables belong to extremely long-lived
sessions that are managed by a connection pooler. I don't quite
understand why the pooler isn't issuing DISCARD ALL between clients,
but anyway it seems that PG is operating as designed here.
regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Francisco Olarte
Date:
On Sat, Sep 12, 2020 at 10:59 AM Giorgio Saviane <gsaviane@gmail.com> wrote: .... > The question is: can you implement some warning thrown by vacuum/autovacuum in case orphaned temporary tables are in theway? > The connection pooler we are using is the one provided by Tomcat 7. I'm gonna check if it supports DISCARD ALL on close. From what I've read the thing is, if the pooler does not work correctly ( by issuing discard all or otherwise taking care of temporary tables ), there are few things as the tables are not orphaned on the server side. What the server sees is someone connects, creates the temporary tables, issues a lot of other commands but does not disconnect, so the tables must be kept alive in case the the session wants to use them again. DISCARD ALL will make them orphans. Francisco Olarte.
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Giorgio Saviane
Date:
Let me clarify.
The sequence of events that led our databases to the wraparound issue is the following:
- The application creates temporary tables that should be by default cleaned at the end of the session. As a matter of fact the pooler doesn't actually close the session, neither - apparently - issues a DISCARD ALL, creating leaks among sessions.
- Since sessions are kept alive with stale temporary tables the autovacuum process cannot do maintenance on transaction ids.
- Since we have no evidence on what is going on, the database reaches the 1 million stop limit to avoid transaction wraparound.
- The application gets then out of order. We stop the Postgres server gracefully and we enter single user mode to run the vacuum
- Any attempt to vacuum the database ends up with the usual "WARNING: database "mydb" must be vacuumed within XXX transactions" message, revealing that the vacuum could not fix up the transaction ids.
- By running this SQL (still in single user mode)
SELECT nspname || '.' || relname AS objname
FROM pg_class JOIN pg_namespace
ON pg_namespace.oid = pg_class.relnamespace
WHERE relpersistence = 't' AND relkind = 'r'
We notice the presence of orphaned tables. Sessions are gone at this point, so they are for sure orphaned. - If we manually drop the orphaned tables and run vacuum again the warning message disappears and the database is restored to normal functioning.
No matter this situation is caused by an application misbehavior or a missed temporary table cleanup, I'm asking if there is a reasonable way to prevent this situation improving the alerting with:
- When vacuum cannot repeatedly perform a transaction fix up in regular multi user mode.
- When vacuum cannot perform a transaction fix up in single user mode because there are orphaned tables in the way.
If we had these alerts we would not have lost a database.
Kind regards
Girogio Saviane
Il giorno sab 12 set 2020 alle ore 12:59 Francisco Olarte <folarte@peoplecall.com> ha scritto:
On Sat, Sep 12, 2020 at 10:59 AM Giorgio Saviane <gsaviane@gmail.com> wrote:
....
> The question is: can you implement some warning thrown by vacuum/autovacuum in case orphaned temporary tables are in the way?
> The connection pooler we are using is the one provided by Tomcat 7. I'm gonna check if it supports DISCARD ALL on close.
From what I've read the thing is, if the pooler does not work
correctly ( by issuing discard all or otherwise taking care of
temporary tables ), there are few things as the tables are not
orphaned on the server side.
What the server sees is someone connects, creates the temporary
tables, issues a lot of other commands but does not disconnect, so
the tables must be kept alive in case the the session wants to use
them again. DISCARD ALL will make them orphans.
Francisco Olarte.
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Francisco Olarte
Date:
Giorgio. On Sat, Sep 12, 2020 at 1:57 PM Giorgio Saviane <gsaviane@gmail.com> wrote: ... It seems I've hit a new gmail "bug-feature" and it has stripped numbering on the reply, doing it by hand, misnumberings are mine..... > The sequence of events that led our databases to the wraparound issue is the following: > > 1.The application creates temporary tables that should be by default cleaned at the end of the session. As a matter offact the pooler doesn't actually close the session, neither - apparently - issues a DISCARD ALL, creating leaks among sessions. > 2.Since sessions are kept alive with stale temporary tables the autovacuum process cannot do maintenance on transactionids. > 3.Since we have no evidence on what is going on, the database reaches the 1 million stop limit to avoid transaction wraparound. > 4.The application gets then out of order. We stop the Postgres server gracefully and we enter single user mode to run thevacuum > 5.Any attempt to vacuum the database ends up with the usual "WARNING: database "mydb" must be vacuumed within XXX transactions"message, revealing that the vacuum could not fix up the transaction ids. > By running this SQL (still in single user mode) > SELECT nspname || '.' || relname AS objname > FROM pg_class JOIN pg_namespace > ON pg_namespace.oid = pg_class.relnamespace > WHERE relpersistence = 't' AND relkind = 'r' > 6.We notice the presence of orphaned tables. Sessions are gone at this point, so they are for sure orphaned. > 7.If we manually drop the orphaned tables and run vacuum again the warning message disappears and the database is restoredto normal functioning. I see your point now. IMO postgres should either have dropped the temporary tables in 4., graceful shutdown or left them in a state where thay could be vacuumed, I did not notice the graceful shutdown step before. Just one question, PG documents shutdown as smart, fast and immediate, which one are you using for "graceful"? ( anyway, after a server restart I would expect tables to be vacuumable, even if it was a power loss followed by recovery ), they are not supposed to outlive the session and as you say the sessions are gone. Francisco Olarte.
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Tom Lane
Date:
Giorgio Saviane <gsaviane@gmail.com> writes: > Let me clarify. > The sequence of events that led our databases to the wraparound issue is > the following: > 1. The application creates temporary tables that should be by default > cleaned at the end of the session. As a matter of fact the pooler doesn't > actually close the session, neither - apparently - issues a DISCARD ALL, > creating leaks among sessions. > 2. Since sessions are kept alive with stale temporary tables the > autovacuum process cannot do maintenance on transaction ids. > 3. Since we have no evidence on what is going on, the database reaches > the 1 million stop limit to avoid transaction wraparound. > 4. The application gets then out of order. We stop the Postgres server > gracefully and we enter single user mode to run the vacuum > 5. Any attempt to vacuum the database ends up with the usual "WARNING: > database "mydb" must be vacuumed within XXX transactions" message, > revealing that the vacuum could not fix up the transaction ids. So, actually, that was the wrong recovery method. All you really needed to do was terminate that long-running session(s) and wait for a minute or two. There is code to delete old orphaned temp tables, even as far back as 9.5. But *it's part of autovacuum*. Dropping down to single-user mode prevents it from running. I thought briefly about whether VACUUM in single-user mode should also run that code; but I'm not really convinced it'd be a good thing to be auto-dropping tables in single-user mode. And anyway we shouldn't be encouraging use of single-user mode, and it'd be much better to not be in this situation to start with. What seems like the most helpful thing we could do is teach autovacuum to issue log warnings when it sees old temp tables holding back datfrozenxid. The trick here is to not spam the log too much. After a few moment's thought, maybe warn if: * a table is temp but not orphaned, and * its relfrozenxid or relminmxid is equal to datfrozenxid or datminmxid, proving that some previous autovac run saw it as holding back those horizons, and * those xids are older than the relevant freeze limit. For extra credit maybe the message could mention the PID of the session owning the table? regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Giorgio Saviane
Date:
Ok, let me add one more step in that sequence.
Before shutting down Postgres and going down to single mode we stopped the application server and other services around it. I'm pretty sure that there was a reasonable window of time before we went to single mode. Moreover, if the database already reached the stop limit for preventing the wraparound, was it still possible for autovacuum to kick in in multiuser mode?
Wasn't the graceful database shutdown enough to let Postgres drop the temporary tables?
Kind regards
Giorgio Saviane
Il giorno sab 12 set 2020 alle ore 17:49 Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
Giorgio Saviane <gsaviane@gmail.com> writes:
> Let me clarify.
> The sequence of events that led our databases to the wraparound issue is
> the following:
> 1. The application creates temporary tables that should be by default
> cleaned at the end of the session. As a matter of fact the pooler doesn't
> actually close the session, neither - apparently - issues a DISCARD ALL,
> creating leaks among sessions.
> 2. Since sessions are kept alive with stale temporary tables the
> autovacuum process cannot do maintenance on transaction ids.
> 3. Since we have no evidence on what is going on, the database reaches
> the 1 million stop limit to avoid transaction wraparound.
> 4. The application gets then out of order. We stop the Postgres server
> gracefully and we enter single user mode to run the vacuum
> 5. Any attempt to vacuum the database ends up with the usual "WARNING:
> database "mydb" must be vacuumed within XXX transactions" message,
> revealing that the vacuum could not fix up the transaction ids.
So, actually, that was the wrong recovery method. All you really
needed to do was terminate that long-running session(s) and wait for
a minute or two. There is code to delete old orphaned temp tables,
even as far back as 9.5. But *it's part of autovacuum*. Dropping
down to single-user mode prevents it from running.
I thought briefly about whether VACUUM in single-user mode should
also run that code; but I'm not really convinced it'd be a good thing
to be auto-dropping tables in single-user mode. And anyway we shouldn't
be encouraging use of single-user mode, and it'd be much better to not
be in this situation to start with.
What seems like the most helpful thing we could do is teach autovacuum to
issue log warnings when it sees old temp tables holding back datfrozenxid.
The trick here is to not spam the log too much. After a few moment's
thought, maybe warn if:
* a table is temp but not orphaned, and
* its relfrozenxid or relminmxid is equal to datfrozenxid or datminmxid,
proving that some previous autovac run saw it as holding back those
horizons, and
* those xids are older than the relevant freeze limit.
For extra credit maybe the message could mention the PID of the
session owning the table?
regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Tom Lane
Date:
Giorgio Saviane <gsaviane@gmail.com> writes: > Ok, let me add one more step in that sequence. > Before shutting down Postgres and going down to single mode we stopped the > application server and other services around it. I'm pretty sure that there > was a reasonable window of time before we went to single mode. Hmm, but did you stop the connection pooler? That's what was holding the database session open, if I'm understanding things correctly. Unless you've changed the default autovacuum settings (particularly autovacuum_naptime), I'd have expected autovacuum to start zapping temp tables within a minute or so after they become orphaned. Even more to the point, if the database session was allowed to stop normally, it should've cleaned them up itself. regards, tom lane
Re: BUG #16614: Stale temporary objects makes vacuum ineffective when 1 million transactions remain
From
Giorgio Saviane
Date:
The connection pooler is a component embedded in the Tomcat container. When Tomcat stops the connection pooler gets shut down too. We can tell it by the number of Postgres processes that disappear from htop.
We never touch autovacuum settings, they are kept to the default.
Kind regards
Giorgio Saviane
Il giorno sab 12 set 2020 alle ore 20:18 Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
Giorgio Saviane <gsaviane@gmail.com> writes:
> Ok, let me add one more step in that sequence.
> Before shutting down Postgres and going down to single mode we stopped the
> application server and other services around it. I'm pretty sure that there
> was a reasonable window of time before we went to single mode.
Hmm, but did you stop the connection pooler? That's what was holding
the database session open, if I'm understanding things correctly.
Unless you've changed the default autovacuum settings (particularly
autovacuum_naptime), I'd have expected autovacuum to start zapping
temp tables within a minute or so after they become orphaned.
Even more to the point, if the database session was allowed to stop
normally, it should've cleaned them up itself.
regards, tom lane