Transaction id wraparound and autovacuum - Mailing list pgsql-general

From Casey Duncan
Subject Transaction id wraparound and autovacuum
Date
Msg-id E6A0E662-2554-4CB5-A061-B18F1866FEEF@pandora.com
Whole thread Raw
Responses Re: Transaction id wraparound and autovacuum  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Late last week we experienced stoppage on two of our high volume
production database servers due to imminent transaction id wrap-
around. The first one caught us unawares and caused a bit of a panic
during lunch 8^). The second one was noticed after we scrutinized all
of the servers and noticed one was throwing the warning that less
than 10 million txids remained before shutdown. Unfortunately due to
the high transaction volume on these servers, we didn't have very
much time to react (in this case < 30 minutes once we saw the warning).

Both of the servers are identically configured, and all run the
autovacuum daemon. At first I thought that I was being naive and had
assumed that the autovac daemon took care of the txid maintenance
when it doesn't. However after re-reviewing the docs, it clearly is
supposed to address this, only in our case it clearly failed. I know
the daemon is running, since I see it working as expected fairly
often when monitoring the boxes. I find it frustrating that the
daemon does not log anything to give you a history of its activity,
but based on previous discussions on the lists it seems the
prevailing opinion is that such activity is not interesting enough.
Especially given the above, I disagree since I now have a severe lack
of forensic evidence to figure out if and when the daemon ever tried
a database-wide vacuum and what became of it.

Here is my autovac config (this is pg 8.1.3 btw, and yes we are
planning to upgrade to 8.1.5):

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on

autovacuum = on         # enable autovacuum subprocess?
autovacuum_naptime = 300        # time between autovacuum runs, in secs
autovacuum_vacuum_threshold = 500   # min # of tuple updates before
                     # vacuum
autovacuum_analyze_threshold = 250  # min # of tuple updates before
                     # analyze
autovacuum_vacuum_scale_factor = 0.2    # fraction of rel size before
                     # vacuum
autovacuum_analyze_scale_factor = 0.1   # fraction of rel size before
                     # analyze
#autovacuum_vacuum_cost_delay = -1  # default vacuum cost delay for
                     # autovac, -1 means use
                     # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1  # default vacuum cost limit for
                     # autovac, -1 means use
                     # vacuum_cost_limit

vacuum_cost_delay = 50          # 0-1000 milliseconds
#vacuum_cost_page_hit = 1       # 0-10000 credits
#vacuum_cost_page_miss = 10     # 0-10000 credits
#vacuum_cost_page_dirty = 20    # 0-10000 credits
vacuum_cost_limit = 200         # 0-10000 credits

The cost options have been tuned for minimal impact under heavy load.

So I wonder if anyone has any insights into why the autovac daemon
failed to detect the imminent txid wraparound? Looking at the code, I
suspect it may have detected the issue, but not had enough time to do
a vacuum before it ran aground. OTOH, I do know that it was
monitoring the "postgres" scratch database, which has no data in it
and would therefore vacuum nearly instantaneously. I'm fairly certain
that this database was not vacuumed because it was still the database
that was referenced in the txid warnings after the server stopped.
Once I vacuumed that db manually it referenced a different database
in the warning.

Here are some random thoughts/suggestions:

- The autovac daemon doesn't initiate a database-wide vacuum until
100,000 txns before the txn id warning hits. In my case this gives
only a few minutes under peak load. According to the docs, this
maintenance is recommended every 500M transactions, seems like it
might be beneficial to multiply this by 100, or maybe more.

- My tuning to make vacuum less impacting seems detrimental to
database-wide vacuums. It takes many hours to perform a database-wide
vacuum (much more time than it takes me to use up 10M txids from what
I have seen). Perhaps this warning should also be increased (or
configurable).

On modern hardware it is possible to chew through txids at more than
1000/sec. At that rate, even 10M ids isn't all that many.

Thanks.

-Casey



pgsql-general by date:

Previous
From: Richard Huxton
Date:
Subject: Re: pg_dump
Next
From: Richard Huxton
Date:
Subject: Re: pg_dump