Thread: XID wraparound with huge pg_largeobject

XID wraparound with huge pg_largeobject

From

David Kensiski

Date:

30 November 2015, 17:58:20

I am working with a client who has a 9.1 database rapidly approaching XID wraparound. They also have an exceedingly large pg_largeobject table (4217 GB) that has never been vacuumed. An attempt to vacuum this on a replica has run for days and never succeeded. (Or more accurately, never been allowed to succeed because we needed to get the replica back on-line.)

Are there creative ways to do such a vacuum with minimal impact on production? Even if I let the vacuum complete on the replica, I don't think I can play accrued logs from the master, can I? Or is there some trick to doing so?

I explored using slony and was all excited until I discovered it won't replicate pg_largeobject because it cannot create triggers on the table.

I started looking into the pg_rewind contrib in 9.5, but it plays back xlogs to revert so would suffer the same problem as the replica.

Any other ideas about how we can do this?

Thanks!

--Dave

Re: XID wraparound with huge pg_largeobject

From

Jeff Janes

Date:

01 December 2015, 17:13:13

On Mon, Nov 30, 2015 at 9:58 AM, David Kensiski <David@kensiski.org> wrote:
> I am working with a client who has a 9.1 database rapidly approaching XID
> wraparound.

The hard limit at 2 billion, or the soft limit at autovacuum_freeze_max_age?

> They also have an exceedingly large pg_largeobject table (4217
> GB) that has never been vacuumed.  An attempt to vacuum this on a replica
> has run for days and never succeeded.

What was slowing it down? Reading? Writing? CPU? fdatasync? Locks?
Was it run with throttling (e.g. nonzero vacuum_cost_delay) or
without?

What is the throughput available on our RAID?

> Are there creative ways to do such a vacuum with minimal impact on
> production?  Even if I let the vacuum complete on the replica, I don't think
> I can play accrued logs from the master, can I?

No.  And if you could replay the logs, I doubt it would have much of a
different impact than just running the vacuum freeze on the master
directly would.  You just need to bite the bullet.

At some point you need to read the entire table in one session, even
if that means scheduling some downtime (or degraded performance time)
in order to do it.  It will also need to rewrite the entire table, but
if there are "vacuum freeze" attempted but which don't run to
completion, their partial work will lessen the amount of writing (but
not reading) the ultimately successful vacuum will need to do.  So
start vacuum freeze now, and if you end up needing to cancel it at
least part of its work will not go wasted.

Cheers,

Jeff

Re: XID wraparound with huge pg_largeobject

From

Roxanne Reid-Bennett

Date:

01 December 2015, 21:48:01

On 11/30/2015 9:58 AM, David Kensiski wrote:

I am working with a client who has a 9.1 database rapidly approaching XID wraparound. They also have an exceedingly large pg_largeobject table (4217 GB) that has never been vacuumed. An attempt to vacuum this on a replica has run for days and never succeeded. (Or more accurately, never been allowed to succeed because we needed to get the replica back on-line.)
...
Any other ideas about how we can do this?

David,

My gut reaction was maybe dump/restore ... but it's pg_largeobject. I have read the list for years and my memory tells me that it is a problem child in that arena. (e.g. as you found out w Slony...) and at 4000Gb, not something that can sandbox very well.

Because it's v9.1... and you hadn't gotten any responses (until Jeff)... and I had the time... I did some digging in the archives...

The most promising alternate idea... Last February Adam Hooper was migrating to SSD, Bill Moran suggesting trying to CLUSTER pg_largeobject instead of VACUUM FULL. (full topic: on 2/3/2015 entitled "VACUUM FULL pg_largeobject without (much) downtime?")

CLUSTER has been referenced in the list other times to collapse unused space.... (sometime in 2010-2011):

As a last resort this week, I'm going to get 500+GB of extra file store
added, add a tablespace and move pg_largeobjects to this area. Then use
CLUSTER to rebuild pg_largeobjects back in the default tablespace. This
should fix things I hope, and if needed I'll use Cluster regularly.

It's "an" other idea... I've no idea whether it will work any better than biting the bullet and just running VACUUM FULL.

other bits and pieces...

In 2010, Tom suggested REINDEX then VACUUM on pg_largeobject for an 8.? system. That peaked my interest because we found with 9.1 that weekly reindexing helped with performance. However the person who used it didn't find any performance improvement with his VACUUM. I think reindexing was added to VACUUM FULL in the 9.0 release (but would have to search the release notes to find it).

I remember reading somewhere during this (but can't find the reference <sigh>) that an interrupted VACUUM FREEZE does capture "some" data, so multiples of those actually incrementally improves the speed of the next - but again I can't find the reference, so I've no idea who, when, version, and whether my memory is faulty or misapplied.

There are miscellaneous improvements in the actual running of VACUUM FULL (and more often autovacuum) suggested through tweaking the vacuum parameters "vacuum_cost_delay" being a high priority target. Jeff's questions all point an identifying any limitations that are costing you time due to configuration.

Totally not part of this specific problem... You have run or know of vacuumlo for deleting orphaned LOs...? Might be worth running it before you collect your free space. [just in case you didn't or hadn't... twice the bang, half the pain - but only if you do it before collecting your free space]

Roxanne

Re: XID wraparound with huge pg_largeobject

From

David Kensiski

Date:

02 December 2015, 16:25:13

On Tue, Dec 1, 2015 at 9:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Nov 30, 2015 at 9:58 AM, David Kensiski <David@kensiski.org> wrote:
> I am working with a client who has a 9.1 database rapidly approaching XID
> wraparound.

The hard limit at 2 billion, or the soft limit at autovacuum_freeze_max_age?

We're approaching the hard limit -- we are about to break 1.5 billion.

> They also have an exceedingly large pg_largeobject table (4217
> GB) that has never been vacuumed. An attempt to vacuum this on a replica
> has run for days and never succeeded.

What was slowing it down? Reading? Writing? CPU? fdatasync? Locks?
Was it run with throttling (e.g. nonzero vacuum_cost_delay) or
without?

I just talked to my colleague who had tested it and it just stopped doing anything. No cpu, no disk i/o, no apparent activity. No bueno.

What is the throughput available on our RAID?

It's 6 drives in a RAID 10 configuration, so striped across three Seagate Barracuda drives. Theoretically we should be able to get as much as 18 Gb/s, actual mileage may vary.

> Are there creative ways to do such a vacuum with minimal impact on
> production? Even if I let the vacuum complete on the replica, I don't think
> I can play accrued logs from the master, can I?

No. And if you could replay the logs, I doubt it would have much of a
different impact than just running the vacuum freeze on the master
directly would. You just need to bite the bullet.

At some point you need to read the entire table in one session, even
if that means scheduling some downtime (or degraded performance time)
in order to do it. It will also need to rewrite the entire table, but
if there are "vacuum freeze" attempted but which don't run to
completion, their partial work will lessen the amount of writing (but
not reading) the ultimately successful vacuum will need to do. So
start vacuum freeze now, and if you end up needing to cancel it at
least part of its work will not go wasted.

Unpleasant, but if that's what we have to do, we have to do it. :-(

--Dave

Re: XID wraparound with huge pg_largeobject

From

David Kensiski

Date:

02 December 2015, 16:37:02

On Tue, Dec 1, 2015 at 1:48 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:

On 11/30/2015 9:58 AM, David Kensiski wrote:
I am working with a client who has a 9.1 database rapidly approaching XID wraparound. They also have an exceedingly large pg_largeobject table (4217 GB) that has never been vacuumed. An attempt to vacuum this on a replica has run for days and never succeeded. (Or more accurately, never been allowed to succeed because we needed to get the replica back on-line.)
...
Any other ideas about how we can do this?

David,

My gut reaction was maybe dump/restore ... but it's pg_largeobject. I have read the list for years and my memory tells me that it is a problem child in that arena. (e.g. as you found out w Slony...) and at 4000Gb, not something that can sandbox very well.

I attempted to dump the qa system and pipe it to restore on a test server and it crashed qa. Doesn't a dump run as a single transaction, so a long running dump on a busy database accrues more and more resources that don't get released until it completes? Or am I missing something?

Either way, it means several days of downtime given the size of the database. Not really practical.

Because it's v9.1... and you hadn't gotten any responses (until Jeff)... and I had the time... I did some digging in the archives...

The most promising alternate idea... Last February Adam Hooper was migrating to SSD, Bill Moran suggesting trying to CLUSTER pg_largeobject instead of VACUUM FULL. (full topic: on 2/3/2015 entitled "VACUUM FULL pg_largeobject without (much) downtime?")

CLUSTER has been referenced in the list other times to collapse unused space.... (sometime in 2010-2011):
As a last resort this week, I'm going to get 500+GB of extra file store
added, add a tablespace and move pg_largeobjects to this area. Then use
CLUSTER to rebuild pg_largeobjects back in the default tablespace. This
should fix things I hope, and if needed I'll use Cluster regularly.

This is intriguing and is worth testing to see if it would work. Getting the space to do it on production might be tough since all drive slots are full and we're already using 3 TB drives. But worth trying in qa and if it works, then I can get creative for prod.

It's "an" other idea... I've no idea whether it will work any better than biting the bullet and just running VACUUM FULL.

other bits and pieces...

In 2010, Tom suggested REINDEX then VACUUM on pg_largeobject for an 8.? system. That peaked my interest because we found with 9.1 that weekly reindexing helped with performance. However the person who used it didn't find any performance improvement with his VACUUM. I think reindexing was added to VACUUM FULL in the 9.0 release (but would have to search the release notes to find it).

I remember reading somewhere during this (but can't find the reference <sigh>) that an interrupted VACUUM FREEZE does capture "some" data, so multiples of those actually incrementally improves the speed of the next - but again I can't find the reference, so I've no idea who, when, version, and whether my memory is faulty or misapplied.

I've seen this as well, and Jeff alluded to it in his post. Subsequent vacuums will run faster after a partial vacuum since some of the data has already been processed.

There are miscellaneous improvements in the actual running of VACUUM FULL (and more often autovacuum) suggested through tweaking the vacuum parameters "vacuum_cost_delay" being a high priority target. Jeff's questions all point an identifying any limitations that are costing you time due to configuration.

Definitely be tweaking the parameters to get the best performance if we have to go the vacuum route.

Totally not part of this specific problem... You have run or know of vacuumlo for deleting orphaned LOs...? Might be worth running it before you collect your free space. [just in case you didn't or hadn't... twice the bang, half the pain - but only if you do it before collecting your free space]

I don't think anyone has tried to vacuumlo, but another one I can test in qa. If it improves vacuum full performance it might be worth doing it in two phases.

Thanks to both of you for your input! (I forgot to thank Jeff in my reply to him.)

--Dave

Re: XID wraparound with huge pg_largeobject

From

CS DBA

Date:

02 December 2015, 17:04:37

On 12/02/2015 09:36 AM, David Kensiski wrote:

On Tue, Dec 1, 2015 at 1:48 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 11/30/2015 9:58 AM, David Kensiski wrote:
I am working with a client who has a 9.1 database rapidly approaching XID wraparound. They also have an exceedingly large pg_largeobject table (4217 GB) that has never been vacuumed. An attempt to vacuum this on a replica has run for days and never succeeded. (Or more accurately, never been allowed to succeed because we needed to get the replica back on-line.)
...
Any other ideas about how we can do this?

David,

My gut reaction was maybe dump/restore ... but it's pg_largeobject. I have read the list for years and my memory tells me that it is a problem child in that arena. (e.g. as you found out w Slony...) and at 4000Gb, not something that can sandbox very well.

I attempted to dump the qa system and pipe it to restore on a test server and it crashed qa. Doesn't a dump run as a single transaction, so a long running dump on a busy database accrues more and more resources that don't get released until it completes? Or am I missing something?

Either way, it means several days of downtime given the size of the database. Not really practical.
Because it's v9.1... and you hadn't gotten any responses (until Jeff)... and I had the time... I did some digging in the archives...

The most promising alternate idea... Last February Adam Hooper was migrating to SSD, Bill Moran suggesting trying to CLUSTER pg_largeobject instead of VACUUM FULL. (full topic: on 2/3/2015 entitled "VACUUM FULL pg_largeobject without (much) downtime?")

CLUSTER has been referenced in the list other times to collapse unused space.... (sometime in 2010-2011):
As a last resort this week, I'm going to get 500+GB of extra file store
added, add a tablespace and move pg_largeobjects to this area. Then use
CLUSTER to rebuild pg_largeobjects back in the default tablespace. This
should fix things I hope, and if needed I'll use Cluster regularly.
This is intriguing and is worth testing to see if it would work. Getting the space to do it on production might be tough since all drive slots are full and we're already using 3 TB drives. But worth trying in qa and if it works, then I can get creative for prod.

It's "an" other idea... I've no idea whether it will work any better than biting the bullet and just running VACUUM FULL.

other bits and pieces...

In 2010, Tom suggested REINDEX then VACUUM on pg_largeobject for an 8.? system. That peaked my interest because we found with 9.1 that weekly reindexing helped with performance. However the person who used it didn't find any performance improvement with his VACUUM. I think reindexing was added to VACUUM FULL in the 9.0 release (but would have to search the release notes to find it).

I remember reading somewhere during this (but can't find the reference <sigh>) that an interrupted VACUUM FREEZE does capture "some" data, so multiples of those actually incrementally improves the speed of the next - but again I can't find the reference, so I've no idea who, when, version, and whether my memory is faulty or misapplied.

I've seen this as well, and Jeff alluded to it in his post. Subsequent vacuums will run faster after a partial vacuum since some of the data has already been processed.

There are miscellaneous improvements in the actual running of VACUUM FULL (and more often autovacuum) suggested through tweaking the vacuum parameters "vacuum_cost_delay" being a high priority target. Jeff's questions all point an identifying any limitations that are costing you time due to configuration.

Definitely be tweaking the parameters to get the best performance if we have to go the vacuum route.

Totally not part of this specific problem... You have run or know of vacuumlo for deleting orphaned LOs...? Might be worth running it before you collect your free space. [just in case you didn't or hadn't... twice the bang, half the pain - but only if you do it before collecting your free space]

I don't think anyone has tried to vacuumlo, but another one I can test in qa. If it improves vacuum full performance it might be worth doing it in two phases.

Thanks to both of you for your input! (I forgot to thank Jeff in my reply to him.)

--Dave

Not sure if it applies in your case (large objects via bytea vs pg built in large objects). However, maybe you can partition the table as follows:

1) create a new master table
2) create the needed partitions
3) begin migrating in the background, as you migrate each partition, via a select * from current table, you will not be copying any dead rows
4) once all existing data is copied perform the following:
   a) declare outage
   b) copy remaining data
   c) rename or drop the old table
   d) rename the new master table to the old table name
   e) end outage

Re: XID wraparound with huge pg_largeobject

From

Jeff Janes

Date:

02 December 2015, 17:18:41

On Wed, Dec 2, 2015 at 8:25 AM, David Kensiski <David@kensiski.org> wrote:
>
>
> On Tue, Dec 1, 2015 at 9:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> On Mon, Nov 30, 2015 at 9:58 AM, David Kensiski <David@kensiski.org>
>> wrote:
>> > I am working with a client who has a 9.1 database rapidly approaching
>> > XID
>> > wraparound.
>>
>> The hard limit at 2 billion, or the soft limit at
>> autovacuum_freeze_max_age?
>
>
> We're approaching the hard limit -- we are about to break 1.5 billion.
>
>
>> > They also have an exceedingly large pg_largeobject table (4217
>> > GB) that has never been vacuumed.  An attempt to vacuum this on a
>> > replica
>> > has run for days and never succeeded.
>>
>> What was slowing it down? Reading? Writing? CPU? fdatasync? Locks?
>> Was it run with throttling (e.g. nonzero vacuum_cost_delay) or
>> without?
>
>
> I just talked to my colleague who had tested it and it just stopped doing
> anything.  No cpu, no disk i/o, no apparent activity.  No bueno.

If you can get it going again, I'd either strace it or attach gdb to
it to get a backtrace, if you have those tools, to see what is going
on.

Is there a lot of free space in pg_largeobjects table (i.e. recently
ran vacuumlo)?  I wonder if it weren't doing a very slow backwards
scan over the table in order to truncate away unused space.  The
problem is that the backwards scan might not trigger the kernels
read-ahead code, so every page is read in as a random IO, rather than
sequential IO.  This can look a lot like doing nothing, depending on
what monitoring tools you use.  the disk is always busy, but just in a
horribly inefficient way.

Assuming you are on minor release 9.1.10 or later, if this is the case
you should be able to just have another session do a `lock TABLE
pg_largeobject in access share mode;` and hold the lock for a while.
This will cause the vacuum to abandon the truncation scan and finish
up the accounting for the freezing.  You can then worry about
finishing up the truncation once the wrap-around danger is over.

>>
>>
>> What is the throughput available on our RAID?
>
>
> It's 6 drives in a RAID 10 configuration, so striped across three Seagate
> Barracuda drives.  Theoretically we should be able to get as much as 18
> Gb/s, actual mileage may vary.

Does it have non-volatile write buffer to absorb fsyns requests
without having to wait for them to actually reach disk?  If not, you
could have a problem with the small ring buffer that vacuum uses.  In
order to read in a new page, it first needs to kick out an existing
one.  But for a freeze operation where every block needs freezing, the
existing page is almost certainly dirty, so it needs to write it out.
To write it, it needs to write and fsync the WAL record which covers
the dirtying of that page.  Since vacuum uses a small ring buffer,
this happens very often and can really slow things down.

If it used a larger ring of buffers, this would be less of a problem
because the buffers have a longer time to cool off before being
reused, so their WAL is probably already on disk by that time.
Unfortunately there is no way to increase the ring buffer size without
compiling your own postgres.

Cheers,

Jeff

Re: XID wraparound with huge pg_largeobject

From

Jim Nasby

Date:

02 December 2015, 23:21:00

On 12/2/15 11:18 AM, Jeff Janes wrote:
> Is there a lot of free space in pg_largeobjects table (i.e. recently
> ran vacuumlo)?  I wonder if it weren't doing a very slow backwards
> scan over the table in order to truncate away unused space.  The
> problem is that the backwards scan might not trigger the kernels
> read-ahead code, so every page is read in as a random IO, rather than
> sequential IO.  This can look a lot like doing nothing, depending on
> what monitoring tools you use.  the disk is always busy, but just in a
> horribly inefficient way.
>
> Assuming you are on minor release 9.1.10 or later, if this is the case
> you should be able to just have another session do a `lock TABLE
> pg_largeobject in access share mode;` and hold the lock for a while.
> This will cause the vacuum to abandon the truncation scan and finish
> up the accounting for the freezing.  You can then worry about
> finishing up the truncation once the wrap-around danger is over.

Another option would be to manually read in the heap files at the OS
level so they're at least in the kernel cache. Looking at the "max()" of
the ctid column in the table would give you an idea of what's going on
here. (I put max() in quotes because there aren't real operators on
ctid. I suspect the easiest way to simulate this would be to create a
compound type (block int, lp smallint), cast ctid to that and then do
max(block*2^16+lp).)

I would *definitely* at least start a vacuum freeze on the table. You
really don't want to get backed into a corner on this.

Something else to keep in mind is that a full-scan vacuum (which is what
this needs to be) will block on a block if any other process has a pin
on it (and by doing so, block anyone else from getting a pin on that
block until it's done). If you have other processes hitting this table
heavily that could cause a problem. Unfortunately, I don't think this
locking would show up in pg_locks, so it's hard to diagnose. (BTW, I
think the logic to optionally skip this lock was added in 9.2, in which
case *any* vacuum you run could run into this problem).

You might want to watch the talk I gave at pgCon about the details of
how vacuum works. Note that's all based on 9.4 though, so YMMV.
http://www.pgcon.org/2015/schedule/events/829.en.html
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com