Thread: Cannot use a standalone backend to VACUUM in "postgres""
We have a PostgreSQL 8.2.6 installation running for about six-months now. There was a lot of log entries saying (sometimes 10 or more in just one second): WARNING: oldest xmin is far in the past HINT: Close open transactions soon to avoid wraparound problems. (actually it was in Spanish but I think that's irrelevant). The funny thing is that there was no open transactions, even after restarting the cluster the same message was logged. Today, the database stopped working as expected: ERROR: database is shut down to avoid wraparound data loss in database "postgres" HINT: Stop the postmaster and use a standalone backend to VACUUM in "postgres" So, the postmaster was stopped to follow the hint but even in stand-alone mode postgres keeps saying: WARNING: database "postgres" must be vacuumed within 999805 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "postgres". WARNING: oldest xmin is far in the past HINT: Close open transactions soon to avoid wraparound problems. Every time vacuum is run the number decreases by one but after a few runs I still cannot access the cluster :-(. (My plan was to take a pg_dumpall and then re-init the cluster.) Attached is the output of: SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r'; SELECT datname, age(datfrozenxid) FROM pg_database; pg_controldata says: pg_control version number: 822 Catalog version number: 200611241 Database system identifier: 5040396405114363383 Database cluster state: in production pg_control last modified: Mon 07 Apr 2008 09:22:19 PM CDT Current log file ID: 33 Next log file segment: 91 Latest checkpoint location: 21/5A8EC824 Prior checkpoint location: 21/5A8C8CDC Latest checkpoint's REDO location: 21/5A8EC824 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 0/2280224912 Latest checkpoint's NextOID: 103405 Latest checkpoint's NextMultiXactId: 64513935 Latest checkpoint's NextMultiOffset: 154155767 Time of latest checkpoint: Mon 07 Apr 2008 09:20:54 PM CDT Minimum recovery ending location: 0/0 Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: es_MX.ISO-8859-1 LC_CTYPE: es_MX.ISO-8859-1 Please let me know if there is more information needed. Regards, Manuel.
Attachment
Manuel Sugawara wrote: Hi Manuel, > The funny thing is that there was no open transactions, even after > restarting the cluster the same message was logged. Today, the > database stopped working as expected: > > ERROR: database is shut down to avoid wraparound data loss in database "postgres" > HINT: Stop the postmaster and use a standalone backend to VACUUM in "postgres" I suggest you look for temp tables that have not been reclaimed. We've had a couple of reports where leftover temp tables have stopped the frozen-xid counter from advancing. (They would have a very old relfrozenxid.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hi Manuel, Hi Alvaro! > I suggest you look for temp tables that have not been reclaimed. > We've had a couple of reports where leftover temp tables have > stopped the frozen-xid counter from advancing. (They would have a > very old relfrozenxid.) Thank you very much for the suggestion. Any pointers on how to do that? A quick serch in google didn't show anything relevant. Regards, Manuel.
Manuel Sugawara <masm@fciencias.unam.mx> writes: > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> Hi Manuel, > > Hi Alvaro! > >> I suggest you look for temp tables that have not been reclaimed. >> We've had a couple of reports where leftover temp tables have >> stopped the frozen-xid counter from advancing. (They would have a >> very old relfrozenxid.) > > Thank you very much for the suggestion. Any pointers on how to do > that? A quick serch in google didn't show anything relevant. Will look into pg_class, of course. Somehow I was thinking something else. Thanks again. Regards, Manuel.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Manuel Sugawara wrote: > > Hi Manuel, > >> The funny thing is that there was no open transactions, even after >> restarting the cluster the same message was logged. Today, the >> database stopped working as expected: >> >> ERROR: database is shut down to avoid wraparound data loss in database "postgres" >> HINT: Stop the postmaster and use a standalone backend to VACUUM in "postgres" > > I suggest you look for temp tables that have not been reclaimed. We've > had a couple of reports where leftover temp tables have stopped the > frozen-xid counter from advancing. (They would have a very old > relfrozenxid.) In each database executed: select relname, age(relfrozenxid) from pg_class where relnamespace = '10406'::oid; (note that 10406 is the oid of the pg_temp_1 namespace) none of them showed temp tables, Is this the correct way?, Any other idea?. Regards, Manuel.
Manuel Sugawara wrote: > In each database executed: > > select relname, age(relfrozenxid) from pg_class where relnamespace = '10406'::oid; > > (note that 10406 is the oid of the pg_temp_1 namespace) none of them > showed temp tables, Is this the correct way?, Any other idea?. Hmm, nope -- take away the relnamespace check, because there is a different namespace for each backend (pg_temp_2, pg_temp_3, etc). And as far as I've seen, most leftover temp tables are on "high" temp schemas (i.e. those belonging to backends that are only used when the load is high). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, nope -- take away the relnamespace check, because there is a > different namespace for each backend (pg_temp_2, pg_temp_3, etc). And > as far as I've seen, most leftover temp tables are on "high" temp > schemas (i.e. those belonging to backends that are only used when the > load is high). Yeah, because the low-numbered ones get recycled first. To have a temp table survive for long enough to create this problem, it's pretty much got to be in a high-numbered temp schema. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, nope -- take away the relnamespace check, because there is a > different namespace for each backend (pg_temp_2, pg_temp_3, etc). Still no luck, changed the query to: select relname, age(relfrozenxid) from pg_class join pg_namespace n on (n.oid = relnamespace) where nspname ~ '^pg_temp' and no temp tables showed in any database :-(. Any other idea?. Regards, Manuel.
Manuel Sugawara wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > Hmm, nope -- take away the relnamespace check, because there is a > > different namespace for each backend (pg_temp_2, pg_temp_3, etc). > > Still no luck, changed the query to: > > select relname, age(relfrozenxid) from pg_class join pg_namespace n on (n.oid = relnamespace) where nspname ~ '^pg_temp' > > and no temp tables showed in any database :-(. Any other idea?. None :-( Is there any table with a large age value, regardless of temp-ness? Does the age of the oldest table correspond to the age of pg_database.datfrozenxid? The interesting database is the one with the largest age(pg_database.datfrozenxid). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Manuel Sugawara wrote: Going back to your first message I see that I missed something important: > 1: relname = "pg_tablespace" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) > 1: relname = "pg_pltemplate" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) > 1: relname = "pg_shdepend" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) > 1: relname = "pg_shdescription" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) > 1: relname = "pg_database" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) > 1: relname = "pg_auth_members" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: age = "2146484675" (typeid = 23, len = 4, typmod = -1, byval = t) These are all shared catalogs AFAIR. Have you vacuumed those? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Manuel Sugawara wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >> >> > Hmm, nope -- take away the relnamespace check, because there is a >> > different namespace for each backend (pg_temp_2, pg_temp_3, etc). >> >> Still no luck, changed the query to: >> >> select relname, age(relfrozenxid) from pg_class join pg_namespace n on (n.oid = relnamespace) where nspname ~ '^pg_temp' >> >> and no temp tables showed in any database :-(. Any other idea?. > > None :-( Is there any table with a large age value, regardless of > temp-ness? Does the age of the oldest table correspond to the age of > pg_database.datfrozenxid? The interesting database is the one with the > largest age(pg_database.datfrozenxid). A friend is doing the queries right now but it takes time :-(, mean time I was wondering if it will be safe to apply the following patch just to get the database up and be able to run pg_dumpall: *** postgresql-8.2.6/src/backend/access/transam/varsup.c~ 2006-11-05 16:42:07.000000000 -0600 --- postgresql-8.2.6/src/backend/access/transam/varsup.c 2008-04-08 18:34:51.000000000 -0500 *************** *** 225,231 **** * vacuuming requires one transaction per table cleaned, we had better be * sure there's lots of XIDs left...) */ ! xidStopLimit = xidWrapLimit - 1000000; if (xidStopLimit < FirstNormalTransactionId) xidStopLimit -= FirstNormalTransactionId; --- 225,231 ---- * vacuuming requires one transaction per table cleaned, we had better be * sure there's lots of XIDs left...) */ ! xidStopLimit = xidWrapLimit - 500000; if (xidStopLimit < FirstNormalTransactionId) xidStopLimit -= FirstNormalTransactionId; Regards, Manuel.
Manuel Sugawara wrote: > A friend is doing the queries right now but it takes time :-(, mean > time I was wondering if it will be safe to apply the following patch > just to get the database up and be able to run pg_dumpall: Yes, it is safe. Just make sure to get a copy of the database out in 500k transactions ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Manuel Sugawara wrote: > >> A friend is doing the queries right now but it takes time :-(, mean >> time I was wondering if it will be safe to apply the following patch >> just to get the database up and be able to run pg_dumpall: > > Yes, it is safe. Just make sure to get a copy of the database out in > 500k transactions ... Ok, will do that. Thanks a lot Alvaro. Want me to send more details to debug the problem?. Regards, Manuel.
Manuel Sugawara wrote: > Ok, will do that. Thanks a lot Alvaro. Want me to send more details to > debug the problem?. Yes, it would be good to know why the shared catalogs were not being vacuumed, if you can find that out. I would have guessed that they weren't being vacuumed due to the fact that they can only be vacuumed by a superuser, but in a standalone backend you are always superuser. So if you ran a database-wide vacuum, they should have been processed. What happens if you try "vacuum pg_database", etc? Does the age(relfrozenxid) change? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.