Re: Could not finish anti-wraparound VACUUM when stop limit is reached - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Could not finish anti-wraparound VACUUM when stop limit is reached |
Date | |
Msg-id | 53906D68.3000706@vmware.com Whole thread Raw |
In response to | Re: Could not finish anti-wraparound VACUUM when stop limit is reached (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
On 06/04/2014 12:45 AM, Jeff Janes wrote: > On Sun, May 25, 2014 at 8:40 AM, Heikki Linnakangas <hlinnakangas@vmware.com >> wrote: > >> While debugging the B-tree bug that Jeff Janes reported ( >> http://www.postgresql.org/message-id/CAMkU=1y=VwF07Ay+Cpqk_ >> 7FpiHRctmssV9y99SBGhitkXPbf8g@mail.gmail.com), a new issue came up: >> >> If you reach the xidStopLimit, and try to run VACUUM, it fails with error: >> >> jjanes=# vacuum; >> ERROR: database is not accepting commands to avoid wraparound data loss >> in database "jjanes" >> HINT: Stop the postmaster and vacuum that database in single-user mode. >> You might also need to commit or roll back old prepared transactions. >> >> The backtrace looks like this: >> >> #0 errstart (elevel=20, filename=0x9590a0 "varsup.c", lineno=120, >> funcname=0x9593f0 <__func__.10455> "GetNewTransactionId", domain=0x0) >> at elog.c:249 >> #1 0x00000000004f7c14 in GetNewTransactionId (isSubXact=0 '\000') at >> varsup.c:115 >> #2 0x00000000004f86db in AssignTransactionId (s=0xd62900 >> <TopTransactionStateData>) >> at xact.c:510 >> #3 0x00000000004f84a4 in GetCurrentTransactionId () at xact.c:382 >> #4 0x000000000062dc1c in vac_truncate_clog (frozenXID=1493663893, >> minMulti=1) >> at vacuum.c:909 >> #5 0x000000000062dc06 in vac_update_datfrozenxid () at vacuum.c:888 >> #6 0x000000000062cdf6 in vacuum (vacstmt=0x29e05e0, relid=0, do_toast=1 >> '\001', >> bstrategy=0x2a5cc38, for_wraparound=0 '\000', isTopLevel=1 '\001') at >> vacuum.c:294 >> #7 0x00000000007a3c55 in standard_ProcessUtility (parsetree=0x29e05e0, >> queryString=0x29dfbf8 "vacuum ;", context=PROCESS_UTILITY_TOPLEVEL, >> params=0x0, >> dest=0x29e0988, completionTag=0x7fff9411a490 "") at utility.c:645 >> >> So, vac_truncate_clog() tries to get a new transaction ID, which fails >> because we've already reached the stop-limit. vac_truncate_clog() doesn't >> really need a new XID to be assigned, though, it only uses it to compare >> against datfrozenxid to see if wrap-around has already happened, so it >> could use ReadNewTransactionId() instead. >> > > > Like the attached patch, or does it need to be more complicated than that? Yeah, that's what I had in mind. However, I just realized that there's a little race condition: 1. vac_truncate_clog() calls ReadNextTransactionId(), and then for some reason goes to sleep for a long time. 2. Another backend runs VACUUM FREEZE, and advances datfrozenxid 3. vac_truncate_clog() continues, finds that the new datfrozenxid < myXID, and gives the warning that wrap-around has already happened. The fact that GetCurrentTransaction() sets MyProc->xid is crucial to preventing that race condition today. It prevents the other backend from advancing datfrozenxid to a value greater than MyXID. So what we'd really need is a version of ReadNewTransactionId() that sets also MyProc->xmin. Or we could call GetSnapshotData() first. >> Jeff's database seems to have wrapped around already, because after fixing >> the above, I get this: >> >> jjanes=# vacuum; >> WARNING: some databases have not been vacuumed in over 2 billion >> transactions >> DETAIL: You might have already suffered transaction-wraparound data loss. >> VACUUM >> > >> We do not truncate clog when wraparound has already happened, so we never >> get past that point. Jeff advanced XID counter aggressively with some >> custom C code, so hitting the actual wrap-around is a case of "don't do >> that". Still, the case is quite peculiar: pg_controldata says that nextXid >> is 4/1593661139. The oldest datfrozenxid is equal to that, 1593661139. So >> ISTM he managed to not just wrap around, but execute 2 billion more >> transactions after the wraparound and reach datfrozenxid again. I'm not >> sure how that happened. > > > I applied the attached patch to current git head (0ad1a816320a2b539a5) and > use it to start up a copy of the indicated data directory in ordinary (not > single user) mode after deleting postmaster.opts. Autovacuum starts, > eventually completes, and then the database becomes usable again. > > Perhap while investigating the matter you did something that pushed it over > the edge. Could you try it again from a fresh untarring of the data > directory and see if you can reproduce my results? I was not able to reproduce that error anymore, so I think we're good. I think I used ReadNewTransactionId() - 1 in my earlier tests, which was wrong. If there are no transactions running, datfrozenxid is set to the *next* XID that will be assigned, which is smaller than ReadNewTransactionId() - 1, so the warning was bogus. (investigating this lead me to realize the race condition mentioned above) - Heikki
pgsql-hackers by date: