Re: pg_upgrade and wraparound - Mailing list pgsql-general

From Adrian Klaver
Subject Re: pg_upgrade and wraparound
Date
Msg-id 651b245a-8874-0cd4-827c-a3691dbda6eb@aklaver.com
Whole thread Raw
In response to Re: pg_upgrade and wraparound  (Alexander Shutyaev <shutyaev@gmail.com>)
Responses Re: pg_upgrade and wraparound  (Alexander Shutyaev <shutyaev@gmail.com>)
List pgsql-general
On 06/11/2018 11:32 AM, Alexander Shutyaev wrote:
> I'm back with more details.
> 
> First, I've deleted the smaller sslentry database, since I don't need 
> it, just so that it doesn't somehow spoil the picture. Now there is only 
> 1 user database - bof (OID=16400). After that I've ran the pg_upgrade on 
> a clean 10.4 cluster and it failed in the same way.
> 
> Now, the answers to your queries.
> 
>  >> 2) The upgrade stops because of transaction ID wraparound, which is 
> strange as that is not showing up in the 9.6 cluster I presume. You 
> might want the queries found below on the 9.6 and 10 clusters to help 
> figure this out:
> 
>  >> SELECT c.oid::regclass as table_name,
>  >>        greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
>  >> FROM pg_class c
>  >> LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
>  >> WHERE c.relkind IN ('r', 'm');
> 
> 9.6 result - see attached 96-query1.txt
> 10.4 result - see attached 104-query1.txt

I should have been clearer in my previous post, the above query is per 
database. From the query below I am going to say the above query was 
done on the bof database. Is that correct?

Given the below from 96-query2.txt:

template0 | 110588398

Can you run the table_name query in template0 in the 9.6 cluster?


> 
>  >> SELECT datname, age(datfrozenxid) FROM pg_database;
> 
> 9.6 result - see attached 96-query2.txt
> 10.4 result - see attached 104-query2.txt
> 
>  >>  It might be useful
>  >> to see pg_controldata output for the old cluster, as well as
>  >> "select datname, datfrozenxid from pg_database" output from the
>  >> old cluster.
> 
> for the query - see above, for pg_controldata:
> 
> 9.6 - see attached 96-pg_controldata.txt
> 10.4 - see attached 104-pg_controldata.txt
> 
>  >> Alexander, could you hack things up so autovacuum logging is enabled
>  >> (log_autovacuum_min_duration=0), and see whether it's triggered?
> 
> I'll be happy to, but that will require to run pg_upgrade once more and 
> that takes more that half a day and during this time clusters are not 
> available to me. Given the data I'm attaching it may happen that the 
> colleagues will want to see something else from my clusters or maybe 
> change some settings before running the pg_upgrade again. Therefore, 
> I'll wait 12 hours after this message in case there will be any more 
> requests and the I'll run the pg_upgrade again.
> 
> Thank you all for trying to solve this matter, this is much appreciated! :)
> 
> 2018-06-11 20:29 GMT+03:00 Andres Freund <andres@anarazel.de 
> <mailto:andres@anarazel.de>>:
> 
>     On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
>     > Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> writes:
>     > > I suspect the issue is that pg_resetwal does:
>     > >     if (set_xid != 0)
>     > >     {
>     > >             ControlFile.checkPointCopy.nextXid = set_xid;
>     > 
>     > >             /*
>     > >              * For the moment, just set oldestXid to a value that will force
>     > >              * immediate autovacuum-for-wraparound.  It's not clear whether adding
>     > >              * user control of this is useful, so let's just do something that's
>     > >              * reasonably safe.  The magic constant here corresponds to the
>     > >              * maximum allowed value of autovacuum_freeze_max_age.
>     > >              */
>     > >             ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
>     > >             if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
>     > >                     ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
>     > >             ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
>     > >     }
>     > 
>     > > but we have codepath that doesn't check for oldestXidDB being
>     > > InvalidOid.  Not great.
>     > 
>     > Hm, I think I'd define the problem as "pg_resetwal is violating the
>     > expectation that oldestXidDB be valid".
> 
>     Well, what could it do otherwise?  ForceTransactionIdLimitUpdate()
>     currently does a syscache check for database existence. That'll just
>     return a lookup failure for InvalidOid, so we're reasonably good on that
>     front.
> 
>     Using a hardcoded 2000000000 seems worse, will have funny results if
>     running with a smaller autovacuum_freeze_max_age...
> 
> 
>     > However, this just explains the basically-cosmetic issue that the
>     > complaint message mentions OID 0.  It doesn't really get us to the
>     > answer to why Alexander is seeing a failure.  It might be useful
>     > to see pg_controldata output for the old cluster, as well as
>     > "select datname, datfrozenxid from pg_database" output from the
>     > old cluster.
> 
>     pg_upgrade starts the server with autovacuum disabled, I suspect
>     restoring all the large objects ends up using a lot of transaction
>     ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
>     where things are going wrong for some reason.
> 
>     Alexander, could you hack things up so autovacuum logging is enabled
>     (log_autovacuum_min_duration=0), and see whether it's triggered?
> 
>     I'm not entirely clear why pg_restore appears to use a separate
>     transaction for each large object, surely exascerbating the problem.
> 
>     Greetings,
> 
>     Andres Freund
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Alexander Shutyaev
Date:
Subject: Re: pg_upgrade and wraparound
Next
From: Alexander Shutyaev
Date:
Subject: Re: pg_upgrade and wraparound