Re: Could not finish anti-wraparound VACUUM when stop limit is reached - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Could not finish anti-wraparound VACUUM when stop limit is reached
Date
Msg-id 53906D68.3000706@vmware.com
Whole thread Raw
In response to Re: Could not finish anti-wraparound VACUUM when stop limit is reached  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On 06/04/2014 12:45 AM, Jeff Janes wrote:
> On Sun, May 25, 2014 at 8:40 AM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> While debugging the B-tree bug that Jeff Janes reported (
>> http://www.postgresql.org/message-id/CAMkU=1y=VwF07Ay+Cpqk_
>> 7FpiHRctmssV9y99SBGhitkXPbf8g@mail.gmail.com), a new issue came up:
>>
>> If you reach the xidStopLimit, and try to run VACUUM, it fails with error:
>>
>> jjanes=# vacuum;
>> ERROR:  database is not accepting commands to avoid wraparound data loss
>> in database "jjanes"
>> HINT:  Stop the postmaster and vacuum that database in single-user mode.
>> You might also need to commit or roll back old prepared transactions.
>>
>> The backtrace looks like this:
>>
>> #0  errstart (elevel=20, filename=0x9590a0 "varsup.c", lineno=120,
>>      funcname=0x9593f0 <__func__.10455> "GetNewTransactionId", domain=0x0)
>> at elog.c:249
>> #1  0x00000000004f7c14 in GetNewTransactionId (isSubXact=0 '\000') at
>> varsup.c:115
>> #2  0x00000000004f86db in AssignTransactionId (s=0xd62900
>> <TopTransactionStateData>)
>>      at xact.c:510
>> #3  0x00000000004f84a4 in GetCurrentTransactionId () at xact.c:382
>> #4  0x000000000062dc1c in vac_truncate_clog (frozenXID=1493663893,
>> minMulti=1)
>>      at vacuum.c:909
>> #5  0x000000000062dc06 in vac_update_datfrozenxid () at vacuum.c:888
>> #6  0x000000000062cdf6 in vacuum (vacstmt=0x29e05e0, relid=0, do_toast=1
>> '\001',
>>      bstrategy=0x2a5cc38, for_wraparound=0 '\000', isTopLevel=1 '\001') at
>> vacuum.c:294
>> #7  0x00000000007a3c55 in standard_ProcessUtility (parsetree=0x29e05e0,
>>      queryString=0x29dfbf8 "vacuum ;", context=PROCESS_UTILITY_TOPLEVEL,
>> params=0x0,
>>      dest=0x29e0988, completionTag=0x7fff9411a490 "") at utility.c:645
>>
>> So, vac_truncate_clog() tries to get a new transaction ID, which fails
>> because we've already reached the stop-limit. vac_truncate_clog() doesn't
>> really need a new XID to be assigned, though, it only uses it to compare
>> against datfrozenxid to see if wrap-around has already happened, so it
>> could use ReadNewTransactionId() instead.
>>
>
>
> Like the attached patch, or does it need to be more complicated than that?

Yeah, that's what I had in mind.

However, I just realized that there's a little race condition:

1. vac_truncate_clog() calls ReadNextTransactionId(), and then for some 
reason goes to sleep for a long time.
2. Another backend runs VACUUM FREEZE, and advances datfrozenxid
3. vac_truncate_clog() continues, finds that the new datfrozenxid < 
myXID, and gives the warning that wrap-around has already happened.

The fact that GetCurrentTransaction() sets MyProc->xid is crucial to 
preventing that race condition today. It prevents the other backend from 
advancing datfrozenxid to a value greater than MyXID.

So what we'd really need is a version of ReadNewTransactionId() that 
sets also MyProc->xmin. Or we could call GetSnapshotData() first.

>> Jeff's database seems to have wrapped around already, because after fixing
>> the above, I get this:
>>
>> jjanes=# vacuum;
>> WARNING:  some databases have not been vacuumed in over 2 billion
>> transactions
>> DETAIL:  You might have already suffered transaction-wraparound data loss.
>> VACUUM
>>
>
>> We do not truncate clog when wraparound has already happened, so we never
>> get past that point. Jeff advanced XID counter aggressively with some
>> custom C code, so hitting the actual wrap-around is a case of "don't do
>> that". Still, the case is quite peculiar: pg_controldata says that nextXid
>> is 4/1593661139. The oldest datfrozenxid is equal to that, 1593661139. So
>> ISTM he managed to not just wrap around, but execute 2 billion more
>> transactions after the wraparound and reach datfrozenxid again. I'm not
>> sure how that happened.
>
>
> I applied the attached patch to current git head (0ad1a816320a2b539a5) and
> use it to start up a copy of the indicated data directory in ordinary (not
> single user) mode after deleting postmaster.opts.  Autovacuum starts,
> eventually completes, and then the database becomes usable again.
>
> Perhap while investigating the matter you did something that pushed it over
> the edge.  Could you try it again from a fresh untarring of the data
> directory and see if you can reproduce my results?

I was not able to reproduce that error anymore, so I think we're good. I 
think I used ReadNewTransactionId() - 1 in my earlier tests, which was 
wrong. If there are no transactions running, datfrozenxid is set to the 
*next* XID that will be assigned, which is smaller than 
ReadNewTransactionId() - 1, so the warning was bogus. (investigating 
this lead me to realize the race condition mentioned above)

- Heikki



pgsql-hackers by date:

Previous
From: Linos
Date:
Subject: Re: performance regression in 9.2/9.3
Next
From: Merlin Moncure
Date:
Subject: Re: performance regression in 9.2/9.3