Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower. - Mailing list pgsql-bugs

From David Gould
Subject Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Date
Msg-id 20151031013714.7e29af3b@engels
Whole thread Raw
In response to Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.  (David Gould <daveg@sonic.net>)
Responses Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.  (David Gould <daveg@sonic.net>)
List pgsql-bugs
On Fri, 30 Oct 2015 23:19:52 -0700
David Gould <daveg@sonic.net> wrote:

> On Fri, 30 Oct 2015 21:49:00 -0700
> Jeff Janes <jeff.janes@gmail.com> wrote:

> > The attached patch does that.  In a system with 4 CPUs and that had
> > 100,000 tables, with a big chunk of them in need of vacuuming, and
> > with 30 worker processes, this increased the throughput by a factor of
> > 40.  Presumably it will do even better with more CPUs.
> >
> > It is still horribly inefficient, but 40 times less so.
>
> That is a good result for such a small change.
>
> The attached patch against REL9_5_STABLE_goes a little further. It
> claims the table under the lock, but also addresses the problem of all the
> workers racing to redo the same table by enforcing an ordering on all the
> workers. No worker can claim a table with an oid smaller than the highest
> oid claimed by any worker. That is, instead of racing to the same table,
> workers leapfrog over each other.
>
> In theory the recheck of the stats could be eliminated although this patch
> does not do that. It does eliminate the special handling of stats snapshots
> for autovacuum workers which cuts back on the excess rewriting of the stats
> file somewhat.
>
> I'll send numbers shortly, but as I recall it is over 100 times better than
> the original.

As promised here are numbers. The setup is a 2 core haswell i3 with a
single SSD. The system is fanless, so it slows down after a few minutes of
load. The database has 40,000 tiny tables freshly created. Autovacuum will
try to analyze them, but that is not much work per table so the number of
tables analyzed per minute is a pretty good measure of the recheck
overhead and contention among the workers.

Unpatched postgresql 9.5beta1 (I let it run for over an hour but it did not
get very far):

seconds  elapsed  actions   chunk   sec/av   av/min
  430.1    430.1     1000    1000    0.430    139.5
 1181.2    751.1     2000    1000    0.751     79.9
 1954.0    772.7     3000    1000    0.773     77.6
 2618.5    664.5     4000    1000    0.664     90.3
 3305.7    687.2     5000    1000    0.687     87.3
 4010.1    704.4     6000    1000    0.704     85.2


A ps sample from partway through the run. Most of the cpu used is by
the stats collector:
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
30212 ?        Ss     0:00 postgres: autovacuum launcher process
30213 ?        Ds     0:55 postgres: stats collector process
30221 ?        Ss     0:23 postgres: autovacuum worker process   avac
30231 ?        Ss     0:12 postgres: autovacuum worker process   avac
30243 ?        Ss     0:11 postgres: autovacuum worker process   avac
30257 ?        Ss     0:10 postgres: autovacuum worker process   avac



postgresql 9.5beta1 plus my ordered oids/high watermark autovacuum patch:

seconds  elapsed  actions   chunk   sec/av   av/min
   13.4     13.4     1000    1000    0.013   4471.9
   22.9      9.5     2000    1000    0.010   6299.9
   31.9      8.9     3000    1000    0.009   6718.9
   40.2      8.3     4000    1000    0.008   7220.2
   52.2     12.1     5000    1000    0.012   4973.1
   59.5      7.2     6000    1000    0.007   8318.3
   69.4     10.0     7000    1000    0.010   6024.7
   78.9      9.5     8000    1000    0.010   6311.8
   93.5     14.6     9000    1000    0.015   4105.1
  104.3     10.7    10000    1000    0.011   5601.7
  114.4     10.2    11000    1000    0.010   5887.0
  127.5     13.1    12000    1000    0.013   4580.9
  140.1     12.6    13000    1000    0.013   4763.0
  153.8     13.7    14000    1000    0.014   4388.9
  166.7     12.9    15000    1000    0.013   4638.6
  181.6     14.8    16000    1000    0.015   4043.9
  200.9     19.3    17000    1000    0.019   3113.5
  217.5     16.7    18000    1000    0.017   3600.8
  231.5     14.0    19000    1000    0.014   4285.7
  245.5     14.0    20000    1000    0.014   4286.3
  259.0     13.5    21000    1000    0.013   4449.7
  274.5     15.5    22000    1000    0.015   3874.2
  292.5     18.0    23000    1000    0.018   3332.4
  311.3     18.8    24000    1000    0.019   3190.3
  326.1     14.8    25000    1000    0.015   4047.8
  345.1     19.0    26000    1000    0.019   3158.1
  363.5     18.3    27000    1000    0.018   3270.6
  382.4     18.9    28000    1000    0.019   3167.6
  403.4     21.0    29000    1000    0.021   2855.0
  419.6     16.2    30000    1000    0.016   3701.6

A ps sample from partway through the run. Most of the cpu used is by
workers, not the collector.
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
  872 ?        Ds     0:49 postgres: stats collector process
  882 ?        Ds     3:42 postgres: autovacuum worker process   avac
  953 ?        Ds     3:21 postgres: autovacuum worker process   avac
 1062 ?        Ds     2:56 postgres: autovacuum worker process   avac
 1090 ?        Ds     2:34 postgres: autovacuum worker process   avac

It seems to slow down a bit after a few minutes. I think this may be
because of filling the OS page cache with dirty pages as it is fully IO
bound for most of the test duration. Or possibly cpu throttling. I'll see
about retesting on better hardware.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

pgsql-bugs by date:

Previous
From: David Gould
Date:
Subject: Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Next
From: David Gould
Date:
Subject: Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.