Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower. - Mailing list pgsql-bugs
| From | David Gould | 
|---|---|
| Subject | Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower. | 
| Date | |
| Msg-id | 20151031013714.7e29af3b@engels Whole thread Raw | 
| In response to | Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower. (David Gould <daveg@sonic.net>) | 
| Responses | Re: BUG #13750: Autovacuum slows down with large numbers of
 tables. More workers makes it slower. | 
| List | pgsql-bugs | 
On Fri, 30 Oct 2015 23:19:52 -0700 David Gould <daveg@sonic.net> wrote: > On Fri, 30 Oct 2015 21:49:00 -0700 > Jeff Janes <jeff.janes@gmail.com> wrote: > > The attached patch does that. In a system with 4 CPUs and that had > > 100,000 tables, with a big chunk of them in need of vacuuming, and > > with 30 worker processes, this increased the throughput by a factor of > > 40. Presumably it will do even better with more CPUs. > > > > It is still horribly inefficient, but 40 times less so. > > That is a good result for such a small change. > > The attached patch against REL9_5_STABLE_goes a little further. It > claims the table under the lock, but also addresses the problem of all the > workers racing to redo the same table by enforcing an ordering on all the > workers. No worker can claim a table with an oid smaller than the highest > oid claimed by any worker. That is, instead of racing to the same table, > workers leapfrog over each other. > > In theory the recheck of the stats could be eliminated although this patch > does not do that. It does eliminate the special handling of stats snapshots > for autovacuum workers which cuts back on the excess rewriting of the stats > file somewhat. > > I'll send numbers shortly, but as I recall it is over 100 times better than > the original. As promised here are numbers. The setup is a 2 core haswell i3 with a single SSD. The system is fanless, so it slows down after a few minutes of load. The database has 40,000 tiny tables freshly created. Autovacuum will try to analyze them, but that is not much work per table so the number of tables analyzed per minute is a pretty good measure of the recheck overhead and contention among the workers. Unpatched postgresql 9.5beta1 (I let it run for over an hour but it did not get very far): seconds elapsed actions chunk sec/av av/min 430.1 430.1 1000 1000 0.430 139.5 1181.2 751.1 2000 1000 0.751 79.9 1954.0 772.7 3000 1000 0.773 77.6 2618.5 664.5 4000 1000 0.664 90.3 3305.7 687.2 5000 1000 0.687 87.3 4010.1 704.4 6000 1000 0.704 85.2 A ps sample from partway through the run. Most of the cpu used is by the stats collector: $ ps xww | awk '/collector|autovacuum worker/ && !/awk/' 30212 ? Ss 0:00 postgres: autovacuum launcher process 30213 ? Ds 0:55 postgres: stats collector process 30221 ? Ss 0:23 postgres: autovacuum worker process avac 30231 ? Ss 0:12 postgres: autovacuum worker process avac 30243 ? Ss 0:11 postgres: autovacuum worker process avac 30257 ? Ss 0:10 postgres: autovacuum worker process avac postgresql 9.5beta1 plus my ordered oids/high watermark autovacuum patch: seconds elapsed actions chunk sec/av av/min 13.4 13.4 1000 1000 0.013 4471.9 22.9 9.5 2000 1000 0.010 6299.9 31.9 8.9 3000 1000 0.009 6718.9 40.2 8.3 4000 1000 0.008 7220.2 52.2 12.1 5000 1000 0.012 4973.1 59.5 7.2 6000 1000 0.007 8318.3 69.4 10.0 7000 1000 0.010 6024.7 78.9 9.5 8000 1000 0.010 6311.8 93.5 14.6 9000 1000 0.015 4105.1 104.3 10.7 10000 1000 0.011 5601.7 114.4 10.2 11000 1000 0.010 5887.0 127.5 13.1 12000 1000 0.013 4580.9 140.1 12.6 13000 1000 0.013 4763.0 153.8 13.7 14000 1000 0.014 4388.9 166.7 12.9 15000 1000 0.013 4638.6 181.6 14.8 16000 1000 0.015 4043.9 200.9 19.3 17000 1000 0.019 3113.5 217.5 16.7 18000 1000 0.017 3600.8 231.5 14.0 19000 1000 0.014 4285.7 245.5 14.0 20000 1000 0.014 4286.3 259.0 13.5 21000 1000 0.013 4449.7 274.5 15.5 22000 1000 0.015 3874.2 292.5 18.0 23000 1000 0.018 3332.4 311.3 18.8 24000 1000 0.019 3190.3 326.1 14.8 25000 1000 0.015 4047.8 345.1 19.0 26000 1000 0.019 3158.1 363.5 18.3 27000 1000 0.018 3270.6 382.4 18.9 28000 1000 0.019 3167.6 403.4 21.0 29000 1000 0.021 2855.0 419.6 16.2 30000 1000 0.016 3701.6 A ps sample from partway through the run. Most of the cpu used is by workers, not the collector. $ ps xww | awk '/collector|autovacuum worker/ && !/awk/' 872 ? Ds 0:49 postgres: stats collector process 882 ? Ds 3:42 postgres: autovacuum worker process avac 953 ? Ds 3:21 postgres: autovacuum worker process avac 1062 ? Ds 2:56 postgres: autovacuum worker process avac 1090 ? Ds 2:34 postgres: autovacuum worker process avac It seems to slow down a bit after a few minutes. I think this may be because of filling the OS page cache with dirty pages as it is fully IO bound for most of the test duration. Or possibly cpu throttling. I'll see about retesting on better hardware. -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.
pgsql-bugs by date: