Vacuum rate limit in KBps - Mailing list pgsql-hackers

From Greg Smith
Subject Vacuum rate limit in KBps
Date
Msg-id 4F128D2F.4070200@2ndQuadrant.com
Whole thread Raw
Responses Re: Vacuum rate limit in KBps
Re: Vacuum rate limit in KBps
Re: Vacuum rate limit in KBps
List pgsql-hackers
So far the reaction I've gotten from my recent submission to make
autovacuum log its read/write in MB/s has been rather positive.  I've
been surprised at the unprecedented (to me at least) amount of
backporting onto big production systems it's gotten.  There is a whole
lot of pent up frustration among larger installs over not having good
visibility into how changing cost-based vacuum parameters turns into
real-world units.

That got me thinking:  if MB/s is what everyone wants to monitor, can we
provide a UI to set these parameters that way too?  The attached patch
is a bit rough still, but it does that.  The key was recognizing that
the cost delay plus cost limit can be converted into an upper limit on
cost units per second, presuming the writes themselves are free.  If you
then also assume the worst case--that everything will end up dirty--by
throwing in the block size, too,  you compute a maximum rate in MB/s.
That represents the fastest you can possibly write.

If you then turn that equation around, making the maximum write rate the
input, for any given cost delay and dirty page cost you can solve for
the cost limit--the parameter in fictitious units everyone hates.  It
works like this, with the computation internals logged every time they
run for now:

#vacuum_cost_rate_limit = 4000      # maximum write rate in kilobytes/second
LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20

That's the same cost limit that was there before, except now it's
derived from that maximum write rate figure.  vacuum_cost_limit is gone
as a GUC, replaced with this new vacuum_cost_rate_limit.  Internally,
vacuum_cost_rate_limit hasn't gone anywhere though.  All of the entry
points into vacuum and autovacuum derive an internal-only
VacuumCostLimit as part of any setup or rebalance operation.  But
there's no change to underlying cost management code; the cost limit is
budgeted and accounted for in exactly the same way as it always was.

Why is this set in kilobytes/second rather than using something based on
a memory unit?  That decision was made after noting these values can
also be set in relation options.  Making relation options aware of
memory unit math seemed ambitious relative to its usefulness, and it's
not like KB/s is hard to work with in this context.

OK, I lied; technically this is set in kibibytes per second right now.
Ran out of energy before I got to confirming that was consistent with
all similar GUC settings, will put on my pedantic hat later to check that.

One nice thing that falls out of this is that the *vacuum_cost_delay
settings essentially turn into a boolean.  If the delay is 0, cost
limits are off; set it to any other value, and the rate you get is
driven almost entirely by vacuum_cost_rate_limit (disclaimer mainly
because of issues like sleep time accuracy are possible).  You can see
that at work in these examples:

LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG:  cost limit=100 based on rate limit=4000 KB/s delay=10 dirty cost=20

LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG:  cost limit=100 based on rate limit=2000 KB/s delay=20 dirty cost=20

Halve the delay to 10, and the cost limit drops in half too to keep the
same I/O rate.  Halve the rate limit instead, and the cost limit halves
with it.  Most sites will never need to change the delay figure from
20ms, they can just focus on tuning the more human-readable rate limit
figure instead.  The main reason I thought of to keep the delay around
as an integer still is sites trying to minimize power use, they might
increase it from the normally used 20ms.  I'm not as worried about
postgresql.conf settings bloat to support a valid edge use case, so long
as most sites find a setting unnecessary to tune.  And the autovacuum
side of cost delay should fall into that category with this change.

Here's a full autovacuum log example.  This shows how close to the KBps
rate the server actually got, along with the autovacuum cost balancing
working the same old way (this is after running the boring
autovac-big.sql test case attached here too):

2012-01-15 02:10:51.905 EST: LOG:  cost limit=200 based on rate
limit=4000 KB/s delay=20 dirty cost=20
2012-01-15 02:10:51.906 EST: DEBUG:  autovac_balance_cost(pid=13054
db=16384, rel=16444, cost_rate_limit=4000, cost_limit=200,
cost_limit_base=200, cost_delay=20)

2012-01-15 02:11:05.127 EST: DEBUG:  "t": removed 4999999 row versions
in 22124 pages
2012-01-15 02:11:05.127 EST: DEBUG:  "t": found 4999999 removable,
5000001 nonremovable row versions in 44248 out of 44248 pages
2012-01-15 02:11:05.127 EST: DETAIL:  0 dead row versions cannot be
removed yet.
     There were 0 unused item pointers.
     0 pages are entirely empty.
     CPU 0.27s/0.97u sec elapsed 131.73 sec.
2012-01-15 02:11:05.127 EST: LOG:  automatic vacuum of table
"gsmith.public.t": index scans: 0
     pages: 0 removed, 44248 remain
     tuples: 4999999 removed, 5000001 remain
     buffer usage: 48253 hits, 40296 misses, 43869 dirtied
     avg read rate: 2.390 MiB/s, avg write rate: 2.602 MiB/s
     system usage: CPU 0.27s/0.97u sec elapsed 131.73 sec

I think this new setting suggests the recently adding logging is missing
a combined I/O figure, something that measures reads + writes over the
time period.  This is good enough to demonstrate the sort of UI I was
aiming for in action though.  Administrator says "don't write more than
4MiB/s", and when autovacuum kicks in it averages 2.4 read + 2.6 write.

I see this change as taking something that feels like black arts tuning
magic now and turning it into a simple interface that's for the most
part intuitive.  None of the flexibility is lost here:  you can still
retune the relative dirty vs. miss vs. hit costs, you have the option of
reducing the delay to a small value on a busy server where small sleep
values are possible.  But you don't have to do any of that just to tweak
autovacuum up or down at a gross level; you can just turn the simple "at
most this much write I/O" knob instead.

All implementation notes and concerns from here down.

The original cost limit here defaulted to 200 and allowed a range of 1
to 10000.  The new default of 4000 show these values need to be 20X as
large.  The maximum was adjusted to 200000 KBps.  Look at that, the
maximum rate you can run cost delay vacuum at is 200MB/s; there's
another good example of something that used to be mysterious to compute
that is obvious now.

I didn't adjust the lower limit downward, so it's actually possible to
set the new code to only operate at 1/200 the minimum speed you could
set before.  On the balance this is surely a reduction in foot gun
aiming though, and I could make the minimum 200 to eliminate it.  Seems
a needless detail to worry about.

This code is new and just complicated enough that there's surely some
edge cases I broke here.  In particular I haven't put together a good
concurrent autovacuum test yet to really prove all the balancing logic
still works correctly.  Need to test that with a settings change in the
middle of a long vacuum too.

There's one serious concern I don't have a quick answer to.  What do we
do with in-place upgrade of relations that specified a custom
vacuum_cost_limit?  I can easily chew on getting the right logic to
convert those to equals in the new setting style, but I am not prepared
to go solely on the hook for all in-place upgrade work one might do
here.  Would this be easiest to handle as one of those dump/restore
transformations?  My guess is that's more sensible than the alternative
of making an on-read converter that only writes in the new format, then
worrying about upgrading all old pages before moving forward.  While
this could be an interesting small test case for that sort of thing, I'd
rather not be patient #1 for that part of the long-term in-place upgrade
path right now.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Attachment

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: automating CF submissions (was xlog location arithmetic)
Next
From: Peter Geoghegan
Date:
Subject: Re: Our poll() based WaitLatch implementation is broken