Thread: PgUpgrade bumped my XIDs by ~50M?

PgUpgrade bumped my XIDs by ~50M?

From

Jerry Sievers

Date:

05 April 2018, 01:29:46

We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the
versions shown below.

The system does <5M transactions/day based on sum(commit + abort) from
pg_stat_database.

Autovac is running all possible threads now and upon investigating I see
that thousands of tables are now above the freeze threshold.  Same
tables all appear ~50M xids older than they did yesterday and the
upgrade was less than 24 hours ago.

I have a "safety" snap made of the system before upgrade that can be
used for inspection.

Any ideas why the age jump?

select age(l.relfrozenxid), l.oid::regclass::text as relation, l.relkind, l.relpages, r.oid::regclass::text as "toast
for"
from pg_class l
left join pg_class r on l.oid = r.reltoastrelid
where l.relkind in ('r', 't')
order by age desc,  relation 
limit :limit

ii  postgresql-9.5                      9.5.12-1.pgdg16.04+1                       amd64        object-relational SQL
database,version 9.5 server
 
ii  postgresql-9.6                      9.6.8-1.pgdg16.04+1                        amd64        object-relational SQL
database,version 9.6 server
 

Thanks

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: PgUpgrade bumped my XIDs by ~50M?

From

Bruce Momjian

Date:

05 April 2018, 02:28:27

On Wed, Apr  4, 2018 at 05:29:46PM -0500, Jerry Sievers wrote:
> We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the
> versions shown below.
> 
> The system does <5M transactions/day based on sum(commit + abort) from
> pg_stat_database.
> 
> Autovac is running all possible threads now and upon investigating I see
> that thousands of tables are now above the freeze threshold.  Same
> tables all appear ~50M xids older than they did yesterday and the
> upgrade was less than 24 hours ago.
> 
> I have a "safety" snap made of the system before upgrade that can be
> used for inspection.
> 
> Any ideas why the age jump?

Uh, you can read how pg_upgrade handles frozen xids in pg_upgrade.c:

    https://doxygen.postgresql.org/pg__upgrade_8c_source.html#l00543

I am not sure what would have caused such a jump.  pg_upgrade brings
over the frozen values for each table, and sets the server's frozen
counter to match the old one.

If you run 'pg_dump --binary-upgrade' you will see the frozen xids being
transfered:

    -- For binary upgrade, set heap's relfrozenxid and relminmxid
    UPDATE pg_catalog.pg_class
    SET relfrozenxid = '558', relminmxid = '1'
    WHERE oid = 'public.test'::pg_catalog.regclass;

Is it possible that pg_upgrade used 50M xids while upgrading?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: PgUpgrade bumped my XIDs by ~50M?

From

Jerry Sievers

Date:

05 April 2018, 03:13:36

Bruce Momjian <bruce@momjian.us> writes:

> On Wed, Apr  4, 2018 at 05:29:46PM -0500, Jerry Sievers wrote:
>
>> We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the
>> versions shown below.
>> 
>> The system does <5M transactions/day based on sum(commit + abort) from
>> pg_stat_database.
>> 
>> Autovac is running all possible threads now and upon investigating I see
>> that thousands of tables are now above the freeze threshold.  Same
>> tables all appear ~50M xids older than they did yesterday and the
>> upgrade was less than 24 hours ago.
>> 
>> I have a "safety" snap made of the system before upgrade that can be
>> used for inspection.
>> 
>> Any ideas why the age jump?
>
> Uh, you can read how pg_upgrade handles frozen xids in pg_upgrade.c:
>
>     https://doxygen.postgresql.org/pg__upgrade_8c_source.html#l00543
>
> I am not sure what would have caused such a jump.  pg_upgrade brings
> over the frozen values for each table, and sets the server's frozen
> counter to match the old one.
>
> If you run 'pg_dump --binary-upgrade' you will see the frozen xids being
> transfered:
>
>     -- For binary upgrade, set heap's relfrozenxid and relminmxid
>     UPDATE pg_catalog.pg_class
>     SET relfrozenxid = '558', relminmxid = '1'
>     WHERE oid = 'public.test'::pg_catalog.regclass;
>
> Is it possible that pg_upgrade used 50M xids while upgrading?

Hi Bruce.

Don't think so, as I did just snap the safety snap and ran another
upgrade on that.

And I also compared txid_current for the upgraded snap and our running
production instance.

The freshly upgraded snap is ~50M txids behind the prod instance.

If this is a not too uncommon case of users running amok, then this time
in particular they really went off the charts :-)

Will investigate...

FYI, this is the same system that a few weeks ago issued complaints
during vacuum of an XID younger than relfrozenxid, which a system
restart did mysteriously resolve.

I hope we're going to be OK here.

Thx

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: PgUpgrade bumped my XIDs by ~50M?

From

Bruce Momjian

Date:

05 April 2018, 03:29:06

On Wed, Apr  4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Is it possible that pg_upgrade used 50M xids while upgrading?
> 
> Hi Bruce.
> 
> Don't think so, as I did just snap the safety snap and ran another
> upgrade on that.
> 
> And I also compared txid_current for the upgraded snap and our running
> production instance.
> 
> The freshly upgraded snap is ~50M txids behind the prod instance.

Are the objects 50M behind or is txid_current 50M different?  Higher or
lower?


> 
> If this is a not too uncommon case of users running amok, then this time
> in particular they really went off the charts :-)

I have never heard of this problem.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: PgUpgrade bumped my XIDs by ~50M?

From

Jerry Sievers

Date:

05 April 2018, 06:01:40

Bruce Momjian <bruce@momjian.us> writes:

> On Wed, Apr  4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote:
>
>> Bruce Momjian <bruce@momjian.us> writes:
>> > Is it possible that pg_upgrade used 50M xids while upgrading?
>> 
>> Hi Bruce.
>> 
>> Don't think so, as I did just snap the safety snap and ran another
>> upgrade on that.
>> 
>> And I also compared txid_current for the upgraded snap and our running
>> production instance.
>> 
>> The freshly upgraded snap is ~50M txids behind the prod instance.
>
> Are the objects 50M behind or is txid_current 50M different?  Higher or
> lower?

txid_current is another 12M higher then a few hours ago.  Still waiting
to hear from my reporting team if they changed anything.

This thing is running PgLogical and has a few of our event triggers as
well.  But nothing in this regard changed with the upgrade.

What if some very frequent but trivial statements that did not get
assigned a real TXID in 9.5 on this configuration now are being treated
differently?

What's puzzling too is that when I run my TPS monitor script, it's
clicking along at what looks typical, presently would only amount to
700k transactions/day but we're off-peak.

Thx
>
>
>> 
>> If this is a not too uncommon case of users running amok, then this time
>> in particular they really went off the charts :-)
>
> I have never heard of this problem.

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: PgUpgrade bumped my XIDs by ~50M?

From

Jerry Sievers

Date:

05 April 2018, 21:39:01

Jerry Sievers <gsievers19@comcast.net> writes:

> Bruce Momjian <bruce@momjian.us> writes:
>
>> On Wed, Apr  4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote:
>>
>>> Bruce Momjian <bruce@momjian.us> writes:
>>> > Is it possible that pg_upgrade used 50M xids while upgrading?
>>> 
>>> Hi Bruce.
>>> 
>>> Don't think so, as I did just snap the safety snap and ran another
>>> upgrade on that.
>>> 
>>> And I also compared txid_current for the upgraded snap and our running
>>> production instance.
>>> 
>>> The freshly upgraded snap is ~50M txids behind the prod instance.
>>
>> Are the objects 50M behind or is txid_current 50M different?  Higher or
>> lower?
>
> txid_current is another 12M higher then a few hours ago.  Still waiting
> to hear from my reporting team if they changed anything.

Reporting team claims nothing changed.  

I still have 150 tables ready for autovac just based on freeze age
value.  Autovac is running at our as-config'd max worker count of 20
w/all threads busy as expected.

If I can assume stats such as pg_stat_database start initially cleared
after an upgrade...

Please see that pg_stat_database showing about the number of
transactions that I'd expect for this system and ~1.5 day duration.

How have some objects apparently aged by ~100M transactions (by now at
last check) since the upgrade ??

Thanks



postgres=# select sum(xact_rollback+ xact_commit) from pg_stat_database;
   sum   
---------
 5292417
(1 row)

postgres=# select now() - pg_postmaster_start_time();
       ?column?        
-----------------------
 1 day 13:18:48.721896
(1 row)

postgres=# select version();
                                                                   version
                     
 

----------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.8 on x86_64-pc-linux-gnu (Ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9)
5.4.020160609, 64-bit
 
(1 row)


>
> This thing is running PgLogical and has a few of our event triggers as
> well.  But nothing in this regard changed with the upgrade.
>
> What if some very frequent but trivial statements that did not get
> assigned a real TXID in 9.5 on this configuration now are being treated
> differently?
>
> What's puzzling too is that when I run my TPS monitor script, it's
> clicking along at what looks typical, presently would only amount to
> 700k transactions/day but we're off-peak.
>
> Thx
>>
>>
>>> 
>>> If this is a not too uncommon case of users running amok, then this time
>>> in particular they really went off the charts :-)
>>
>> I have never heard of this problem.

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: PgUpgrade bumped my XIDs by ~50M?

From

Bruce Momjian

Date:

05 April 2018, 23:39:11

On Wed, Apr  4, 2018 at 08:29:06PM -0400, Bruce Momjian wrote:
> On Wed, Apr  4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Is it possible that pg_upgrade used 50M xids while upgrading?
> > 
> > Hi Bruce.
> > 
> > Don't think so, as I did just snap the safety snap and ran another
> > upgrade on that.
> > 
> > And I also compared txid_current for the upgraded snap and our running
> > production instance.
> > 
> > The freshly upgraded snap is ~50M txids behind the prod instance.
> 
> Are the objects 50M behind or is txid_current 50M different?  Higher or
> lower?

Uh, here is a report of a similar problem from March 6, 2018:


https://www.postgresql.org/message-id/flat/C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601%40ebureau.com#C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601@ebureau.com

    I upgraded a very large database from 9.6 to 10.1 via pg_upgrade
    recently, and ever since, the auto vacuum has been busy on a large
    legacy table that has experienced no changes since the upgrade. If the
    whole table had been frozen prior to the upgrade, would you expect it to
    stay frozen? 

It sure smells like we have a bug here.  Could this be statistics
collection instead?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: PgUpgrade bumped my XIDs by ~50M?

From

Jerry Sievers

Date:

06 April 2018, 00:02:16

Bruce Momjian <bruce@momjian.us> writes:

> On Wed, Apr  4, 2018 at 08:29:06PM -0400, Bruce Momjian wrote:
>
>> On Wed, Apr  4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote:
>> > Bruce Momjian <bruce@momjian.us> writes:
>> > > Is it possible that pg_upgrade used 50M xids while upgrading?
>> > 
>> > Hi Bruce.
>> > 
>> > Don't think so, as I did just snap the safety snap and ran another
>> > upgrade on that.
>> > 
>> > And I also compared txid_current for the upgraded snap and our running
>> > production instance.
>> > 
>> > The freshly upgraded snap is ~50M txids behind the prod instance.
>> 
>> Are the objects 50M behind or is txid_current 50M different?  Higher or
>> lower?
>
> Uh, here is a report of a similar problem from March 6, 2018:
>
>
https://www.postgresql.org/message-id/flat/C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601%40ebureau.com#C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601@ebureau.com
>
>     I upgraded a very large database from 9.6 to 10.1 via pg_upgrade
>     recently, and ever since, the auto vacuum has been busy on a large
>     legacy table that has experienced no changes since the upgrade. If the
>     whole table had been frozen prior to the upgrade, would you expect it to
>     stay frozen? 
>
> It sure smells like we have a bug here.  Could this be statistics
> collection instead?

No clue but we still have the 9.5 safety snap that I can make repeated
sub-snaps of for mor testing.

I did one such test yesterday and things looked $normal after the
upgrade.

Noteworthy omission was that I did *not* run the post-analysis.

I could repeat the test more thoroughly.

We run a parallel analyzer framework to get huge DBs processed quickly.

And we generally do it in 2 passes; the first with def_stats_target
scaled down to finish quicker and hopefully be good enough to run
production.  At this point we open the DB cluster for business.

Immediately following we again run the analyzer at full stats_target.

Any other suggestions what to also look at and I'll be glad to do and
report back.

Big Thanks!

-- 
Jerry Sievers
e: jerry.sievers@comcast.net
p: 312.241.7800