Re: Dumping a database that is not accepting commands? - Mailing list pgsql-admin

From Natalie Wenz
Subject Re: Dumping a database that is not accepting commands?
Date
Msg-id 9EB0F71A-0C2A-4ED1-A700-D9C52CBDAD4E@ebureau.com
Whole thread Raw
In response to Re: Dumping a database that is not accepting commands?  (Kevin Grittner <kgrittn@ymail.com>)
Responses Re: Dumping a database that is not accepting commands?
List pgsql-admin

On Sep 17, 2013, at 3:46 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Natalie Wenz <nataliewenz@ebureau.com> wrote:

Sorry; my description of what is going on was a little unclear.
We didn't upgrade the existing database. We moved it to different
hardware, and just created a brand new database to accept the
data that had been backing up in sqlite files while our original
database was offline. I'm still dealing with the wraparound on
the original, just on a different machine.

OK, to restate to be sure I understand, the original database is
being vacuumed in a new location, and a new 9.3 database in the
original location has absorbed the queued data?  Once you complete
the vacuum, you will copy the old data back to the new database at
the old location?

Yes, exactly.


autovacuum_freeze_max_age       | 800000000

Normally the autovacuum seem to keep up sufficiently. We got into
trouble with a bad combination of not-typical long-running
queries, a disk failure and subsequent zpool repair, and the only
person who checks the log files regularly (me) was out of the
office. This has been so painful and slow to recover from, I
don't think we'll ever get into this mess again. (At least not
quite like this. I seem to have a knack for finding *new* ways to
break things.)

A perfect storm of events, eh?  It's hard to have things always go
smoothly in the face of such events, but I see a couple things you
might want to consider.  Increasing autovacuum_freeze_max_age
reduces the amount of time you have to get back on track.  You
might want to take that back down to the default.  There was a bug
causing wraparound prevention autovacuums to trigger too
frequently, which is now fixed in the latest minor releases, so
making that chnage might not be as painful as you expect.  Make
sure you are monitoring for long-running transactions, so you don't
get burned by one that is accidental.  Also, if you have a
processing cycle where there are off-peak hours on a daily or
weekly basis, you might want to run a VACUUM ANALYZE command
durning those windows, to get some of the freezing done before it
is critical.

We talked a little bit about lowering the autovacuum_max_freeze_age, at least some, but there was concern that it would end up doing a lot more lengthy full-table scans. Is that a legitimate concern? Would it be prudent to change any of the other values back to their defaults at the same time? For example, we have the autovacuum_vacuum_cost_delay set to 0, with the idea that we don't mind if we take a hit on performance while the autovacuum is running; our priority is that it be able to finish as quickly as possible. If we start the vacuum earlier, though, maybe that should be bumped up too?

Does the autovacuum do different work when it is vacuuming to prevent wraparound (and that's triggered when a table passes the autovacuum_max_freeze_age, right?) and a vacuum triggered by the table changing in size by a certain amount, or a manually-invoked vacuum?

(Are there any books, or articles, that cover "Vacuuming and Autovacuuming: the gory details"?)

Rereading the Routine Vacuuming page in the docs, this sentence caught my eye:

"However, for static tables (including tables that receive inserts, but no updates or deletes), there is no need to vacuum for space reclamation, so it can be useful to try to maximize the interval between forced autovacuums on very large static tables. Obviously one can do this either by increasing autovacuum_freeze_max_age or decreasing vacuum_freeze_min_age."

We generally never delete from this database at all. This case was unusual; I was migrating the data from one table to another because we added some columns, and changed the datatype of many of the columns from text to more appropriate types (timestamp, int, uuid, inet, etc). Ideally, even then we wouldn't have preferred to delete anything until the whole table was migrated, but disk space became an issue. Bleh. 

With that in mind, would you still recommend putting the autovacuum_max_freeze_age back to 200 million?

I will also take this opportunity to mention again that if anyone
is considering a making a patch for 64-bit xids, you would make
at least one small group of people very, very happy.  :)

While 64-bit xids isn't likely, there is some work taking a more
creatie approach to the issue which might make you even happier in
a year or so.  :-)


Oh, the suspense! I will be on the lookout for that!


Side question: some of the longer queries that were running when
the database yakked were deletes of old data that had been
manually migrated to a new table with a more appropriate format.
We were running out of disk space, so we were trying to clear up
some space by removing data we now had in two places. Does a
delete of previously-frozen rows unfreeze them, or anything like
that? Because in a series of maybe a dozen queries or so, we
deleted billions of rows. Does that generate a significant amount
of extra work for the autovacuumer?

The pages which had tuples deleted would need to be cleaned up by
vacuum, and rewritten.  It would also remove all index entries for
all deleted rows.  It might also scan backward from the end of the
table to release space to the OS.  That could conceivably be enough
avoidable work to make your idea of copying out the remaining data
feasible.  What percentage of the rows were deleted?  Could your
copy out be to a separate set of drives?

I just did a little estimating, and it looks like around 48% of the data from one of the 14TB tables was deleted, about 45 billion rows. Sounds like I gave the vacuum a major amount of work to do. Yikes. I'll look into whether I can copy out to separate drives. I know the server I'm working on has over 300TB free, but I'm not certain of how it's set up. I'll check to see if it's possible to copy to a separate set of drives. 

If you know at the time a row is added what group it will be in for
deletion, it might pay to move to partitioning, so that a group of
rows could be deleted pretty much as fast as you can drop a table.

Partitioning has come up occasionally in the past, as this database can be so cumbersome to work with. I'll look into that some more, and try to pitch it to the Official Deciders.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

pgsql-admin by date:

Previous
From: David Johnston
Date:
Subject: Re: Catch exceptions outside function
Next
From: Kevin Grittner
Date:
Subject: Re: Dumping a database that is not accepting commands?