Thread: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

daveg@sonic.net

Date:

30 October 2015, 13:33:12

The following bug has been logged on the website:

Bug reference:      13750
Logged by:          David Gould
Email address:      daveg@sonic.net
PostgreSQL version: 9.4.5
Operating system:   Linux
Description:

With more than a few tens of thousands of tables in one database
autovacuuming slows down radically and becomes ineffective. Increasing the
number of autovacuum workers makes the slow down worse.

A client has an application that loads data from thousands of external feeds
many times a day. They create a new table for each batch of new data. After
some months old tables are dropped. Typically the database has about 200,000
fairly small tables each of which has a few indexes and a toast. There are
also a lot of temp tables that come and go. pg_class has over 1/2 million
rows. The hosts have 80 hardware threads, 1TB of memory and fusionIO
storage.

They started seeing long running autovacuum workers doing antiwraparound
vacuums. pg_stat_activity showed workers had been vacuuming a single small
table (ex, 10k rows) for several hours, however in the query log the actual
vacuum took less than a second.

When they updated to 9.4.4 they also increased the number of autovacuum
workers in anticipation of the multixact fix causing extra vacuuming.
Several days later they observed massive catalog bloat, eg pg_attribute was
over 200GB of mostly empty pages. This caused new connections to get stuck
in startup as the catalogs no longer fit in the buffer cache.

I then tried experimenting with different setting of autovacuum workers and
found:

/Autovacuum Actions per Hour/
Workers  Actions  per Worker
   1     2110.1     2110.1
   2     1760.8      880.4
   4      647.3      161.8
   8      386.2       48.3
  72       62.0        0.9

I have analyzed this and created reproduction scripts. I'll send that later
today.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

30 October 2015, 13:47:00

daveg@sonic.net wrote:

> With more than a few tens of thousands of tables in one database
> autovacuuming slows down radically and becomes ineffective. Increasing the
> number of autovacuum workers makes the slow down worse.

Yeah, you need to decrease autovacuum_vacuum_cost_delay if you want to
make them go faster.  (As more workers are started, the existing ones
slow down.  The intent is that the I/O bandwidth allocation is kept
constant regardless of how many workers there are.)

> When they updated to 9.4.4 they also increased the number of autovacuum
> workers in anticipation of the multixact fix causing extra vacuuming.
> Several days later they observed massive catalog bloat, eg pg_attribute was
> over 200GB of mostly empty pages. This caused new connections to get stuck
> in startup as the catalogs no longer fit in the buffer cache.

Oh crap.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

30 October 2015, 14:27:15

On Fri, 30 Oct 2015 10:46:46 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> daveg@sonic.net wrote:
>
> > With more than a few tens of thousands of tables in one database
> > autovacuuming slows down radically and becomes ineffective. Increasing the
> > number of autovacuum workers makes the slow down worse.
>
> Yeah, you need to decrease autovacuum_vacuum_cost_delay if you want to
> make them go faster.  (As more workers are started, the existing ones
> slow down.  The intent is that the I/O bandwidth allocation is kept
> constant regardless of how many workers there are.)

The cost delays are all 0. We care about bloat, not bandwidth.

Anyway, they are not actually vacuuming. They are waiting on the
VacuumScheduleLock. And requesting freshs snapshots from the
stats_collector.

Basically there is a loop in do_autovacuum() that looks like:

 ... build list of all tables to vacuum ...
for tab in tables_to_vacuum:
    lock(VacuumScheduleLock)
    for worker in autovacuum_workers:
        if worker.working_on == tab:
            skip = true

    if skip or very_expensive_check_to_see_if_already_vacuumed(tab):
        unlock(VacuumScheduleLock)
        continue
    unlock(VacuumScheduleLock)
    actually_vacuum(tab)

Since all the workers are working on the same list they all compete to
vacuum the next item on the list.

-dg

--
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

30 October 2015, 15:17:23

David Gould wrote:

> The cost delays are all 0. We care about bloat, not bandwidth.
>
> Anyway, they are not actually vacuuming. They are waiting on the
> VacuumScheduleLock. And requesting freshs snapshots from the
> stats_collector.

Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
("very_expensive") pgstat check.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tom Lane

Date:

30 October 2015, 15:40:43

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> David Gould wrote:
>> Anyway, they are not actually vacuuming. They are waiting on the
>> VacuumScheduleLock. And requesting freshs snapshots from the
>> stats_collector.

> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
> ("very_expensive") pgstat check.

Couldn't we simply move that out of the locked stanza?  That is, if no
other worker is working on the table, claim it, and release the lock
immediately.  Then do the "very expensive" check.  If that fails, we
have to re-take the lock to un-claim the table, but that sounds OK.

            regards, tom lane

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

30 October 2015, 15:49:52

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > David Gould wrote:
> >> Anyway, they are not actually vacuuming. They are waiting on the
> >> VacuumScheduleLock. And requesting freshs snapshots from the
> >> stats_collector.
>
> > Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
> > ("very_expensive") pgstat check.
>
> Couldn't we simply move that out of the locked stanza?  That is, if no
> other worker is working on the table, claim it, and release the lock
> immediately.  Then do the "very expensive" check.  If that fails, we
> have to re-take the lock to un-claim the table, but that sounds OK.

Hmm, yeah, that would work.

Of course, if we could avoid the pgstat check completely that would be
even better.  The amount of pgstat traffic that causes is ridiculous.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tom Lane

Date:

30 October 2015, 15:54:12

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Of course, if we could avoid the pgstat check completely that would be
> even better.  The amount of pgstat traffic that causes is ridiculous.

Good point ... shouldn't we have already checked the stats before ever
deciding to try to claim the table?

            regards, tom lane

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

30 October 2015, 16:12:00

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Of course, if we could avoid the pgstat check completely that would be
> > even better.  The amount of pgstat traffic that causes is ridiculous.
>
> Good point ... shouldn't we have already checked the stats before ever
> deciding to try to claim the table?

The second check is there to allow for some other worker (or manual
vacuum) having vacuumed it after we first checked, but which had
finished before we check the array of current jobs.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Jeff Janes

Date:

30 October 2015, 16:16:57

On Fri, Oct 30, 2015 at 8:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Of course, if we could avoid the pgstat check completely that would be
>> even better.  The amount of pgstat traffic that causes is ridiculous.
>
> Good point ... shouldn't we have already checked the stats before ever
> deciding to try to claim the table?

If forces another pg_stat just before it claims the table, in case the
table had just finished being vacuumed.

Cheers,

Jeff

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tom Lane

Date:

30 October 2015, 16:51:51

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Good point ... shouldn't we have already checked the stats before ever
>> deciding to try to claim the table?

> The second check is there to allow for some other worker (or manual
> vacuum) having vacuumed it after we first checked, but which had
> finished before we check the array of current jobs.

I wonder whether that check costs more than it saves.

            regards, tom lane

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 04:23:27

On Fri, 30 Oct 2015 12:17:11 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> David Gould wrote:
>
> > The cost delays are all 0. We care about bloat, not bandwidth.
> >
> > Anyway, they are not actually vacuuming. They are waiting on the
> > VacuumScheduleLock. And requesting freshs snapshots from the
> > stats_collector.
>
> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
> ("very_expensive") pgstat check.

I have an effective patch for this issue that I think is small enough to be
backpatched. I'm cleaning it up a bit and will send it this weekend.

-dg


--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Jeff Janes

Date:

31 October 2015, 04:49:10

On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> David Gould wrote:
>>> Anyway, they are not actually vacuuming. They are waiting on the
>>> VacuumScheduleLock. And requesting freshs snapshots from the
>>> stats_collector.
>
>> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
>> ("very_expensive") pgstat check.
>
> Couldn't we simply move that out of the locked stanza?  That is, if no
> other worker is working on the table, claim it, and release the lock
> immediately.  Then do the "very expensive" check.  If that fails, we
> have to re-take the lock to un-claim the table, but that sounds OK.

The attached patch does that.  In a system with 4 CPUs and that had
100,000 tables, with a big chunk of them in need of vacuuming, and
with 30 worker processes, this increased the throughput by a factor of
40.  Presumably it will do even better with more CPUs.

It is still horribly inefficient, but 40 times less so.

Cheers,

Jeff

Attachment

vac_move_lock.patch

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Jeff Janes

Date:

31 October 2015, 05:16:10

On Fri, Oct 30, 2015 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> Good point ... shouldn't we have already checked the stats before ever
>>> deciding to try to claim the table?
>
>> The second check is there to allow for some other worker (or manual
>> vacuum) having vacuumed it after we first checked, but which had
>> finished before we check the array of current jobs.
>
> I wonder whether that check costs more than it saves.

A single autovacuum worker can run for hours or days.  I don't think
we should start vacuuming a TB size table because it needed vacuuming
days ago, when the initial to-do list was built up, but no longer
does.  So some kind of recheck is needed.

I thought of making the recheck first use whichever snapshot we
currently have hanging around, and then only if it still needs
vacuuming force a fresh snapshot and re-re-check.  The problem with
that is that any previous snapshot is aggressively destroyed at the
end of each vacuum or analyze by the EOXact code.  So we don't
actually have a snapshot hanging around to use until we go to the work
of re-parsing the database stats file.  So you have to take special
steps to exempt AutoVacuumWorker from EOXact code clearing out the
stats, and I don't know what the repercussions of that might be.

We could also relax the freshness requirements of even the final
re-check, perhaps dynamically.  No point in re-parsing a 40MB stats
file to avoid unnecessary vacuuming a 16KB table.  But parsing a 8KB
stats files to avoid unnecessary vacuuming of a 1TB table is well
worth it.   But that runs into the same problem as above.  Once you
have destroyed your previous stats snapshot, you no longer have the
ability to accept stale stats any longer.

Cheers,

Jeff

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 06:20:05

On Fri, 30 Oct 2015 21:49:00 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >> David Gould wrote:
> >>> Anyway, they are not actually vacuuming. They are waiting on the
> >>> VacuumScheduleLock. And requesting freshs snapshots from the
> >>> stats_collector.
> >
> >> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
> >> ("very_expensive") pgstat check.
> >
> > Couldn't we simply move that out of the locked stanza?  That is, if no
> > other worker is working on the table, claim it, and release the lock
> > immediately.  Then do the "very expensive" check.  If that fails, we
> > have to re-take the lock to un-claim the table, but that sounds OK.
>
>
> The attached patch does that.  In a system with 4 CPUs and that had
> 100,000 tables, with a big chunk of them in need of vacuuming, and
> with 30 worker processes, this increased the throughput by a factor of
> 40.  Presumably it will do even better with more CPUs.
>
> It is still horribly inefficient, but 40 times less so.

That is a good result for such a small change.

The attached patch against REL9_5_STABLE_goes a little further. It
claims the table under the lock, but also addresses the problem of all the
workers racing to redo the same table by enforcing an ordering on all the
workers. No worker can claim a table with an oid smaller than the highest
oid claimed by any worker. That is, instead of racing to the same table,
workers leapfrog over each other.

In theory the recheck of the stats could be eliminated although this patch
does not do that. It does eliminate the special handling of stats snapshots
for autovacuum workers which cuts back on the excess rewriting of the stats
file somewhat.

I'll send numbers shortly, but as I recall it is over 100 times better than
the original.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachment

autovacuum_worker_contention.diff

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 06:41:57

On Fri, 30 Oct 2015 12:51:43 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Good point ... shouldn't we have already checked the stats before ever
> >> deciding to try to claim the table?
>
> > The second check is there to allow for some other worker (or manual
> > vacuum) having vacuumed it after we first checked, but which had
> > finished before we check the array of current jobs.
>
> I wonder whether that check costs more than it saves.

It does indeed. It drives the stats collector wild. And of course if there
are lots of tables and indexes the stats temp file gets very large so that
it can take a long time (seconds) to rewrite it. This happens for each
worker for each table that is a candidate for vacuuming.

Since it would not be convenient to provide a copy of the clients 8TB
database I have made a standalone reproduction. The attached files:

 build_test_instance.sh - create a test instance
 datagen.py             - used by above to populate it with lots of tables
 logbyv.awk             - count auto analyze actions in postgres log
 trace.sh               - strace the stats collector and autovacuum workers
 tracereport.sh         - list top 50 calls in strace output

The test process is to run the build_test_instance script to create an
instance with one database with a large number of tiny tables. During the
setup autovacuuming is off. Then make a tarball of the instance for reuse.
For each test case, untar the instance, set the number of workers and start
it. After a short time autovacuum will start workers to analyze the new
tables. Expect to see the stats collector doing lots of writing.

You may want to use tmpfs or a ramdisk for the data dir for building the
test instance. The configuration is sized for reasonable desktop, 8 to 16GB
of memory and an SSD.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachment

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 07:01:20

On Fri, 30 Oct 2015 22:16:04 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Fri, Oct 30, 2015 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >> Tom Lane wrote:
> >>> Good point ... shouldn't we have already checked the stats before ever
> >>> deciding to try to claim the table?
> >
> >> The second check is there to allow for some other worker (or manual
> >> vacuum) having vacuumed it after we first checked, but which had
> >> finished before we check the array of current jobs.
> >
> > I wonder whether that check costs more than it saves.
>
> A single autovacuum worker can run for hours or days.  I don't think
> we should start vacuuming a TB size table because it needed vacuuming
> days ago, when the initial to-do list was built up, but no longer
> does.  So some kind of recheck is needed.
>
> I thought of making the recheck first use whichever snapshot we
> currently have hanging around, and then only if it still needs
> vacuuming force a fresh snapshot and re-re-check.  The problem with
> that is that any previous snapshot is aggressively destroyed at the
> end of each vacuum or analyze by the EOXact code.  So we don't
> actually have a snapshot hanging around to use until we go to the work
> of re-parsing the database stats file.  So you have to take special
> steps to exempt AutoVacuumWorker from EOXact code clearing out the
> stats, and I don't know what the repercussions of that might be.
>
> We could also relax the freshness requirements of even the final
> re-check, perhaps dynamically.  No point in re-parsing a 40MB stats
> file to avoid unnecessary vacuuming a 16KB table.  But parsing a 8KB
> stats files to avoid unnecessary vacuuming of a 1TB table is well
> worth it.   But that runs into the same problem as above.  Once you
> have destroyed your previous stats snapshot, you no longer have the
> ability to accept stale stats any longer.

I've been thinking about this and frankly the whole scheme is a bit
suspect. There are a couple of problems:

1. In an 8TB database with 200,000 tables it is not reasonable to build a
   list in advance of all the tables to vacuum. Inevitably some workers
   will be working off very stale lists.

2. Autovacuum workers do not need a statistics snapshot. They only need the
   current statistics for the table they are considering vacuuming.

I think the following might be too big to back patch or maybe even for
9.5, but I think a better overall scheme would look something like:

- The stats collector accepts a new request message, "give me the stats
  for this: (dboid, table oid)" and handles it by replying to the sender
  (autovacuum worker) with the current statistics entry for that table.

- Instead of the worker building a big list of work up front, it just keeps
  track of a shared highwater mark of tables worked on and does an index
  scan for the next higher oid in pg_class. This is similar to my patch in
  that vacuuming is ordered so that workers never contend, but instead of
  working from a list they look up the next table in pg_class each time
  using the oid index. For each table that might need vacuuming it would
  send the stats collector the request for that tables statistics and use
  the reply to make determine what action to take.

  This avoids the stale work list and the need for rechecking. Also, it
  eliminates all the flogging of the stats files (I suspect SSD owners
  and laptop users will rejoice!).

-dg


--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 08:37:23

On Fri, 30 Oct 2015 23:19:52 -0700
David Gould <daveg@sonic.net> wrote:

> On Fri, 30 Oct 2015 21:49:00 -0700
> Jeff Janes <jeff.janes@gmail.com> wrote:

> > The attached patch does that.  In a system with 4 CPUs and that had
> > 100,000 tables, with a big chunk of them in need of vacuuming, and
> > with 30 worker processes, this increased the throughput by a factor of
> > 40.  Presumably it will do even better with more CPUs.
> >
> > It is still horribly inefficient, but 40 times less so.
>
> That is a good result for such a small change.
>
> The attached patch against REL9_5_STABLE_goes a little further. It
> claims the table under the lock, but also addresses the problem of all the
> workers racing to redo the same table by enforcing an ordering on all the
> workers. No worker can claim a table with an oid smaller than the highest
> oid claimed by any worker. That is, instead of racing to the same table,
> workers leapfrog over each other.
>
> In theory the recheck of the stats could be eliminated although this patch
> does not do that. It does eliminate the special handling of stats snapshots
> for autovacuum workers which cuts back on the excess rewriting of the stats
> file somewhat.
>
> I'll send numbers shortly, but as I recall it is over 100 times better than
> the original.

As promised here are numbers. The setup is a 2 core haswell i3 with a
single SSD. The system is fanless, so it slows down after a few minutes of
load. The database has 40,000 tiny tables freshly created. Autovacuum will
try to analyze them, but that is not much work per table so the number of
tables analyzed per minute is a pretty good measure of the recheck
overhead and contention among the workers.

Unpatched postgresql 9.5beta1 (I let it run for over an hour but it did not
get very far):

seconds  elapsed  actions   chunk   sec/av   av/min
  430.1    430.1     1000    1000    0.430    139.5
 1181.2    751.1     2000    1000    0.751     79.9
 1954.0    772.7     3000    1000    0.773     77.6
 2618.5    664.5     4000    1000    0.664     90.3
 3305.7    687.2     5000    1000    0.687     87.3
 4010.1    704.4     6000    1000    0.704     85.2

A ps sample from partway through the run. Most of the cpu used is by
the stats collector:
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
30212 ?        Ss     0:00 postgres: autovacuum launcher process
30213 ?        Ds     0:55 postgres: stats collector process
30221 ?        Ss     0:23 postgres: autovacuum worker process   avac
30231 ?        Ss     0:12 postgres: autovacuum worker process   avac
30243 ?        Ss     0:11 postgres: autovacuum worker process   avac
30257 ?        Ss     0:10 postgres: autovacuum worker process   avac

postgresql 9.5beta1 plus my ordered oids/high watermark autovacuum patch:

seconds  elapsed  actions   chunk   sec/av   av/min
   13.4     13.4     1000    1000    0.013   4471.9
   22.9      9.5     2000    1000    0.010   6299.9
   31.9      8.9     3000    1000    0.009   6718.9
   40.2      8.3     4000    1000    0.008   7220.2
   52.2     12.1     5000    1000    0.012   4973.1
   59.5      7.2     6000    1000    0.007   8318.3
   69.4     10.0     7000    1000    0.010   6024.7
   78.9      9.5     8000    1000    0.010   6311.8
   93.5     14.6     9000    1000    0.015   4105.1
  104.3     10.7    10000    1000    0.011   5601.7
  114.4     10.2    11000    1000    0.010   5887.0
  127.5     13.1    12000    1000    0.013   4580.9
  140.1     12.6    13000    1000    0.013   4763.0
  153.8     13.7    14000    1000    0.014   4388.9
  166.7     12.9    15000    1000    0.013   4638.6
  181.6     14.8    16000    1000    0.015   4043.9
  200.9     19.3    17000    1000    0.019   3113.5
  217.5     16.7    18000    1000    0.017   3600.8
  231.5     14.0    19000    1000    0.014   4285.7
  245.5     14.0    20000    1000    0.014   4286.3
  259.0     13.5    21000    1000    0.013   4449.7
  274.5     15.5    22000    1000    0.015   3874.2
  292.5     18.0    23000    1000    0.018   3332.4
  311.3     18.8    24000    1000    0.019   3190.3
  326.1     14.8    25000    1000    0.015   4047.8
  345.1     19.0    26000    1000    0.019   3158.1
  363.5     18.3    27000    1000    0.018   3270.6
  382.4     18.9    28000    1000    0.019   3167.6
  403.4     21.0    29000    1000    0.021   2855.0
  419.6     16.2    30000    1000    0.016   3701.6

A ps sample from partway through the run. Most of the cpu used is by
workers, not the collector.
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
  872 ?        Ds     0:49 postgres: stats collector process
  882 ?        Ds     3:42 postgres: autovacuum worker process   avac
  953 ?        Ds     3:21 postgres: autovacuum worker process   avac
 1062 ?        Ds     2:56 postgres: autovacuum worker process   avac
 1090 ?        Ds     2:34 postgres: autovacuum worker process   avac

It seems to slow down a bit after a few minutes. I think this may be
because of filling the OS page cache with dirty pages as it is fully IO
bound for most of the test duration. Or possibly cpu throttling. I'll see
about retesting on better hardware.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 10:21:05

On Sat, 31 Oct 2015 01:37:14 -0700
David Gould <daveg@sonic.net> wrote:

I've added numbers from the same test setup with Jeff Janes patch.

postgresql-9.5beta1 with Jeff Janes vac_move_lock.patch:

seconds  elapsed  actions   chunk   sec/av   av/min
  116.1    116.1     1000    1000    0.116    516.8
  224.0    107.9     2000    1000    0.108    556.1
  356.4    132.4     3000    1000    0.132    453.2
  489.8    133.4     4000    1000    0.133    449.8
  602.5    112.7     5000    1000    0.113    532.3
  706.6    104.1     6000    1000    0.104    576.3
  801.9     95.3     7000    1000    0.095    629.5
  886.3     84.4     8000    1000    0.084    710.9
  983.0     96.7     9000    1000    0.097    620.5
 1074.5     91.5    10000    1000    0.091    656.1
 1168.2     93.7    11000    1000    0.094    640.1
 1259.5     91.3    12000    1000    0.091    656.9
 1350.3     90.8    13000    1000    0.091    661.1
 1434.2     83.9    14000    1000    0.084    715.4
 1516.0     81.9    15000    1000    0.082    733.0

ps sample during run. Note that stats collector is very busy.
 3006 ?        Ds     2:43 postgres: stats collector process
 3014 ?        Ss     1:42 postgres: autovacuum worker process   avac
 3044 ?        Ss     1:29 postgres: autovacuum worker process   avac
 3053 ?        Ds     1:27 postgres: autovacuum worker process   avac
 3060 ?        Ss     1:25 postgres: autovacuum worker process   avac


> Unpatched postgresql 9.5beta1 (I let it run for over an hour but it did not
> get very far):
>
> seconds  elapsed  actions   chunk   sec/av   av/min
>   430.1    430.1     1000    1000    0.430    139.5
>  1181.2    751.1     2000    1000    0.751     79.9
>  1954.0    772.7     3000    1000    0.773     77.6
>  2618.5    664.5     4000    1000    0.664     90.3
>  3305.7    687.2     5000    1000    0.687     87.3
>  4010.1    704.4     6000    1000    0.704     85.2
>
>
> A ps sample from partway through the run. Most of the cpu used is by
> the stats collector:
> $ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
> 30212 ?        Ss     0:00 postgres: autovacuum launcher process
> 30213 ?        Ds     0:55 postgres: stats collector process
> 30221 ?        Ss     0:23 postgres: autovacuum worker process   avac
> 30231 ?        Ss     0:12 postgres: autovacuum worker process   avac
> 30243 ?        Ss     0:11 postgres: autovacuum worker process   avac
> 30257 ?        Ss     0:10 postgres: autovacuum worker process   avac


> postgresql 9.5beta1 plus my ordered oids/high watermark autovacuum patch:
>
> seconds  elapsed  actions   chunk   sec/av   av/min
>    13.4     13.4     1000    1000    0.013   4471.9
>    22.9      9.5     2000    1000    0.010   6299.9
>    31.9      8.9     3000    1000    0.009   6718.9
>    40.2      8.3     4000    1000    0.008   7220.2
>    52.2     12.1     5000    1000    0.012   4973.1
>    59.5      7.2     6000    1000    0.007   8318.3
>    69.4     10.0     7000    1000    0.010   6024.7
>    78.9      9.5     8000    1000    0.010   6311.8
>    93.5     14.6     9000    1000    0.015   4105.1
>   104.3     10.7    10000    1000    0.011   5601.7
>   114.4     10.2    11000    1000    0.010   5887.0
>   127.5     13.1    12000    1000    0.013   4580.9
>   140.1     12.6    13000    1000    0.013   4763.0
>   153.8     13.7    14000    1000    0.014   4388.9
>   166.7     12.9    15000    1000    0.013   4638.6
,,,
> A ps sample from partway through the run. Most of the cpu used is by
> workers, not the collector.
> $ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
>   872 ?        Ds     0:49 postgres: stats collector process
>   882 ?        Ds     3:42 postgres: autovacuum worker process   avac
>   953 ?        Ds     3:21 postgres: autovacuum worker process   avac
>  1062 ?        Ds     2:56 postgres: autovacuum worker process   avac
>  1090 ?        Ds     2:34 postgres: autovacuum worker process   avac

-dg


--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

31 October 2015, 12:19:03

I've collected strace output for a one minute interval with 4 autovacuum
workers on my test setup for stock 9.5beta1, Jeff Janes vac_mov_lock patch,
and my ordered oids / high watermark patch. These results are normalized to
times and number of calls per table analyzed by autovacuum.

To refresh, on this setup:

Patch Version       TPM
none                 85
vac_mov_lock        600
ordered oids       5225

Strace summary:

postgresql 9.5 beta1 cost per table @ 85 tables per minute.
-----------------------------------------------------------

 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
 19.46  196.09    select(0,          <<< Waiting for stats snapshot
  3.26 1040.46    semop(43188238,    <<< Waiting for AutovacuumScheduleLock
  2.05    0.83    sendto(8,          <<< Asking for stats snapshot

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  3.02    0.05    recvfrom(8,        <<< Request for snapshot refresh
  1.55  248.64    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh



9.5beta1 + Jeff Janes vac_mov_lock.patch cost per table @ 600 tpm
-----------------------------------------------------------------

 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
 24.90  260.16    select(0,           <<< Waiting for stats snapshot
  1.41    0.02    sendto(8,           <<< Asking for stats snapshot

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  1.29    0.02    recvfrom(8,         <<< Request for snapshot refresh
  0.59   62.40    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh



9.5beta1 + David Gould autovacuum ordering patch cost per table @ 5225 tpm
--------------------------------------------------------------------------

 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
  0.63    6.34    select(0,          <<< Waiting for stats snapshot
  0.21    0.01    sendto(8,          <<< Asking for stats snapshot
  0.07    0.00    semop(43712518,    <<< Waiting for AutovacuumLock

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  0.40    0.01    recvfrom(8,        <<< Request for snapshot refresh
  0.04    6.75    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh

-dg

--
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

03 November 2015, 22:10:21

On Fri, 30 Oct 2015 23:19:52 -0700
David Gould <daveg@sonic.net> wrote:

> The attached patch against REL9_5_STABLE_goes a little further. It
> claims the table under the lock, but also addresses the problem of all the
> workers racing to redo the same table by enforcing an ordering on all the
> workers. No worker can claim a table with an oid smaller than the highest
> oid claimed by any worker. That is, instead of racing to the same table,
> workers leapfrog over each other.
>
> In theory the recheck of the stats could be eliminated although this patch
> does not do that. It does eliminate the special handling of stats snapshots
> for autovacuum workers which cuts back on the excess rewriting of the stats
> file somewhat.
>
> I'll send numbers shortly, but as I recall it is over 100 times better than
> the original.

I sent performance test data and a setup for reproducing it elsewhere in
the thread. I also ran tests  on a larger system (128GB mem, many cores, 2x
SSD with battery backed raid).

This is for an idle system with 100,000 new small tables to analyze. I ran
all the test for an hour or 5000 tables processed. "jj" refers to the patch
from Jeff Janes, "dg" refers to the attached patch (same as previous).

/autovacuum actions per minute/
workers   9.5b1     jj     dg
-------   -----   ----  -----
   1        183    171    285
   4         45    212   1158
   8         23    462   1225


Could someone please take a look at the patch and comment? Thanks.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachment

autovacuum_worker_contention.diff

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

03 November 2015, 22:24:35

David Gould wrote:

> This is for an idle system with 100,000 new small tables to analyze. I ran
> all the test for an hour or 5000 tables processed. "jj" refers to the patch
> from Jeff Janes, "dg" refers to the attached patch (same as previous).
>
> /autovacuum actions per minute/
> workers   9.5b1     jj     dg
> -------   -----   ----  -----
>    1        183    171    285
>    4         45    212   1158
>    8         23    462   1225

Nice numbers.

> Could someone please take a look at the patch and comment? Thanks.

1. What's with all the FIXMEs?

2. I think you need more of an explanation of what your patch actually
does.

3. Do we want to backpatch?  Changes in behavior aren't acceptable on
existing branches, because it might destabilize autovacuum behavior
that's been carefully tuned in existing systems.  So if we want
something to backpatch, ideally it shouldn't change the ordering in
which tables are vacuumed, and instead arrive at the same results
faster.  (I don't care about this restriction myself, but others do and
strongly so.)

4. In the master branch, behavior changes are acceptable, so we can do
something more invasive.

5. Is it easier to use a binary heap rather than the OID list thing you
have? (see src/include/lib/binaryheap.h)  I don't think so, but it's
worth asking.  Note that older branches don't have this, so
backpatchable should not rely on it.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

07 November 2015, 00:47:04

On Tue, 3 Nov 2015 19:24:25 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> David Gould wrote:
>
> > This is for an idle system with 100,000 new small tables to analyze. I ran
> > all the test for an hour or 5000 tables processed. "jj" refers to the patch
> > from Jeff Janes, "dg" refers to the attached patch (same as previous).
> >
> > /autovacuum actions per minute/
> > workers   9.5b1     jj     dg
> > -------   -----   ----  -----
> >    1        183    171    285
> >    4         45    212   1158
> >    8         23    462   1225
>
> Nice numbers.
>
> > Could someone please take a look at the patch and comment? Thanks.
>
> 1. What's with all the FIXMEs?

Those were a trail of breadcrumbs left to explain some of the issues.
I have removed them and revised the comments.

> 2. I think you need more of an explanation of what your patch actually
> does.

I am writing a more complete description of the issues with the current
code and what the patch does to to address them.

> 3. Do we want to backpatch?  Changes in behavior aren't acceptable on
> existing branches, because it might destabilize autovacuum behavior
> that's been carefully tuned in existing systems.  So if we want
> something to backpatch, ideally it shouldn't change the ordering in
> which tables are vacuumed, and instead arrive at the same results
> faster.  (I don't care about this restriction myself, but others do and
> strongly so.)

The current order of autovacuum operations is the physical order of the
rows in pg_class plus some jitter depending on which worker is able to grab
a table first. It seems unlikely anything could depend on this
particular order.

The heart of this patch is that it establishes a consistent order for all
the workers to take tables from so that they don't compete for work.

The issue is that autovacuum is ineffective and costly with large numbers of
tables. The recent multixact fixes will expose more users to this issue as
they update to 9.3.9/9.4.4 and beyond.

> 4. In the master branch, behavior changes are acceptable, so we can do
> something more invasive.

I will make a version for master.

> 5. Is it easier to use a binary heap rather than the OID list thing you
> have? (see src/include/lib/binaryheap.h)  I don't think so, but it's
> worth asking.  Note that older branches don't have this, so
> backpatchable should not rely on it.

Thanks for the suggestion, but the binary heap requires knowing the size
in advance. The method of using an array and sorting is modeled on:

  src/backend/catalog/pg_inherits.c:find_inheritance_children()

which has a similar need for a sorted list of indefinite size.

Let's move this discussion to hackers to get a bit more coverage, especially
if we are considering a version against master. I will post an expanded
problem description, the revised patch and test results there after the
weekend.

-dg

--
David Gould                                    daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Josh Berkus

Date:

13 November 2015, 19:25:05

On 11/06/2015 04:46 PM, David Gould wrote:
>> 3. Do we want to backpatch?  Changes in behavior aren't acceptable on
>> > existing branches, because it might destabilize autovacuum behavior
>> > that's been carefully tuned in existing systems.  So if we want
>> > something to backpatch, ideally it shouldn't change the ordering in
>> > which tables are vacuumed, and instead arrive at the same results
>> > faster.  (I don't care about this restriction myself, but others do and
>> > strongly so.)
> The current order of autovacuum operations is the physical order of the
> rows in pg_class plus some jitter depending on which worker is able to grab
> a table first. It seems unlikely anything could depend on this
> particular order.

I don't know anyone who depends on the ordering of autovacuum, because
nobody knows what it is.  It's not exactly documented.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

29 February 2016, 21:34:00

Hi David, did you ever post an updated version of this patch?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

02 March 2016, 11:45:54

On Mon, 29 Feb 2016 18:33:50 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Hi David, did you ever post an updated version of this patch?

No. Let me fix that now. I've attached my current revision of the patch
based on master. This version is significantly better than the original
version and resolves multiple issues:

 - autovacuum workers no longer race each other
 - autovacuum workers do not revacuum each others tables
 - autovacuums workers no longer thrash the stats collector which saves a
   lot of IO when the stats is large.

Hopefully the earlier discussion and the comments in the patch are
sufficient, but please ask questions if it is not clear.

The result is that on a freshly created 40,000 table database with tiny
tables that all need an initial analyze the unpatched postgres saturates
an SSD updating the stats and manages to process less than tables per
minute. With the attached patch it processes several thousand tables per
minute.

The following is a summary of strace output for the autovacuum workers
and the stats collector while the 40k table test is running. The counts and
times are the cost per table.

postgresql 9.5:   85 tables per minute.

     Operations per Table
 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
 19.46  196.09    select(0,          <<< Waiting for stats snapshot
  3.26 1040.46    semop(43188238,    <<< Waiting for AutovacuumScheduleLock
  2.05    0.83    sendto(8,          <<< Asking for stats snapshot

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  3.02    0.05    recvfrom(8,        <<< Request for snapshot refresh
  1.55  248.64    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh


+ autovacuum contention patch: 5225 tables per minute

     Operations per Table
 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
  0.63    6.34    select(0,          <<< Waiting for stats snapshot
  0.21    0.01    sendto(8,          <<< Asking for stats snapshot
  0.07    0.00    semop(43712518,    <<< Waiting for AutovacuumLock

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  0.40    0.01    recvfrom(8,        <<< Request for snapshot refresh
  0.04    6.75    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh


Regards,

-dg


--
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachment

autovacuum_contention_0301.patch

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

02 March 2016, 16:01:44

David Gould wrote:

> The following is a summary of strace output for the autovacuum workers
> and the stats collector while the 40k table test is running. The counts and
> times are the cost per table.
>
> postgresql 9.5:   85 tables per minute.

> + autovacuum contention patch: 5225 tables per minute

Great stuff.  I intend to look at this as a backpatchable bug fix during
this commitfest.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Joe Conway

Date:

02 March 2016, 16:13:02

On 03/02/2016 07:35 AM, Alvaro Herrera wrote:
> David Gould wrote:
>> The following is a summary of strace output for the autovacuum workers
>> and the stats collector while the 40k table test is running. The counts =
and
>> times are the cost per table.
>>
>> postgresql 9.5:   85 tables per minute.
>=20
>> + autovacuum contention patch: 5225 tables per minute
>=20
> Great stuff.  I intend to look at this as a backpatchable bug fix during
> this commitfest.

+1!


--=20
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

15 March 2016, 20:40:48

David Gould wrote:
> On Mon, 29 Feb 2016 18:33:50 -0300
> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> > Hi David, did you ever post an updated version of this patch?
>
> No. Let me fix that now. I've attached my current revision of the patch
> based on master. This version is significantly better than the original
> version and resolves multiple issues:

Thanks for the patch.  I spent some time studying it and testing it.  I
think there's a minor bug in this latest version: the "continue" when
the "for" loop exits because of all relids being less than
highest_oid_claimed should be a "break" instead.  Otherwise the worker
will uselessly loop over the whole list of OIDs only to find each time
that it has nothing to do.

After seeing it work under pgbench inserting and deleting large numbers
of tuples into a large number of tables, I figured that the algorithm
you described doesn't actually work in certain corner cases: consider
what happens if table with oid INT_MAX is huge and very bloated.  At
some point one worker is going to be stuck working on it for a long
time, and from that point until the big vacuum is finished, all workers
will see that one worker is processing a table with maximal OID and
decide they can skip all tables in their work lists.

I tested this by creating one table after all the other test tables were
created (so it'd have a higher OID than the others -- but keep an eye on
OID wraparound in your test install), setting a nonzero
vacuum_cost_limit, then filling it with dead tuples in pgbench via
"begin; insert into deadtups select * from generate_series(1, 100000);
rollback".  I can see that one autovacuum process is stuck with it for
over a minute, and while this one runs there is no other autovacuum
activity in the database.

We need to cater for this scenario, because it is going to bit somebody.
I think it might work to still use the OID-sorted list of tables to
process, but instead of pruning based just on the highest OID across all
workers, we need to consider only the OID ranges that we've seen
processed while the current worker has been running.  Seems fiddly :-(

Another thing that makes me uneasy about this approach is that we've had
complaints that some tables that have just enough dead tuples generated
between each autovacuum run on a DB are vacuumed by all workers all the
time, cause vacuum-starvation of other tables that accumulate more dead
tuples.  It seems to me we should prioritise tables to vacuum depending
on how much they actually need vacuum.  If we depend on an algorithm
that sorts by OID, we will never be able to prioritise anything.

One seemingly trivial way to attack this problem is to forget the
sorting idea and just keep track in shared memory of the last (say) 100
tables that have been vacuumed by each worker, alongside with a
timestamp for each.  Using that, each worker can determine whether each
table on its own list has been vacuumed after it obtained its list, and
skip it if so.  Since there's no specific sort order required, we could
in the future change the ordering in which tables are processed for
other reasons.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

15 March 2016, 21:28:32

On Tue, 15 Mar 2016 17:40:26 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> David Gould wrote:
> > On Mon, 29 Feb 2016 18:33:50 -0300
> > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > > Hi David, did you ever post an updated version of this patch?
> >
> > No. Let me fix that now. I've attached my current revision of the patch
> > based on master. This version is significantly better than the original
> > version and resolves multiple issues:
>
> Thanks for the patch.  I spent some time studying it and testing it.  I
> think there's a minor bug in this latest version: the "continue" when
> the "for" loop exits because of all relids being less than
> highest_oid_claimed should be a "break" instead.  Otherwise the worker
> will uselessly loop over the whole list of OIDs only to find each time
> that it has nothing to do.

Good catch. Thanks!

> After seeing it work under pgbench inserting and deleting large numbers
> of tuples into a large number of tables, I figured that the algorithm
> you described doesn't actually work in certain corner cases: consider
> what happens if table with oid INT_MAX is huge and very bloated.  At
> some point one worker is going to be stuck working on it for a long
> time, and from that point until the big vacuum is finished, all workers
> will see that one worker is processing a table with maximal OID and
> decide they can skip all tables in their work lists.
>
> I tested this by creating one table after all the other test tables were
> created (so it'd have a higher OID than the others -- but keep an eye on
> OID wraparound in your test install), setting a nonzero
> vacuum_cost_limit, then filling it with dead tuples in pgbench via
> "begin; insert into deadtups select * from generate_series(1, 100000);
> rollback".  I can see that one autovacuum process is stuck with it for
> over a minute, and while this one runs there is no other autovacuum
> activity in the database.
>
> We need to cater for this scenario, because it is going to bit somebody.
> I think it might work to still use the OID-sorted list of tables to
> process, but instead of pruning based just on the highest OID across all
> workers, we need to consider only the OID ranges that we've seen
> processed while the current worker has been running.  Seems fiddly :-(

This is a good observation. I was focused on the bad behavior when there are
many small tables, not one large table at the end of the list. The problem
with all this vacuum code is that every one who worked on it seemed to focus
on a particular use scenario and was blind to the other. Including me
apparently. However, I think I have a simple fix. I'll think about it a bit
and post it once it has gelled a bit.

> Another thing that makes me uneasy about this approach is that we've had
> complaints that some tables that have just enough dead tuples generated
> between each autovacuum run on a DB are vacuumed by all workers all the
> time, cause vacuum-starvation of other tables that accumulate more dead
> tuples.  It seems to me we should prioritise tables to vacuum depending
> on how much they actually need vacuum.  If we depend on an algorithm
> that sorts by OID, we will never be able to prioritise anything.
>
>
> One seemingly trivial way to attack this problem is to forget the
> sorting idea and just keep track in shared memory of the last (say) 100
> tables that have been vacuumed by each worker, alongside with a
> timestamp for each.  Using that, each worker can determine whether each
> table on its own list has been vacuumed after it obtained its list, and
> skip it if so.  Since there's no specific sort order required, we could
> in the future change the ordering in which tables are processed for
> other reasons.

Your discomfort is well founded. The more I learned about autovacuum
scheduling the less it made sense. Really, there should be some sort of
priority order for vacuuming based on some metric of need and tables should be
processed in that order. I'm willing to work on that, and some other
autovacuum issues that I found too. But this patch is intended to be back
patchable.

More later.

-dg


--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

16 March 2016, 21:00:55

On Tue, 15 Mar 2016 14:28:16 -0700
David Gould <daveg@sonic.net> wrote:

> On Tue, 15 Mar 2016 17:40:26 -0300
> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> > David Gould wrote:
> > > On Mon, 29 Feb 2016 18:33:50 -0300
> > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > > Hi David, did you ever post an updated version of this patch?
> > >
> > > No. Let me fix that now. I've attached my current revision of the patch
> > > based on master. This version is significantly better than the original
> > > version and resolves multiple issues:
> >
> > Thanks for the patch.  I spent some time studying it and testing it.  I
> > think there's a minor bug in this latest version: the "continue" when
> > the "for" loop exits because of all relids being less than
> > highest_oid_claimed should be a "break" instead.  Otherwise the worker
> > will uselessly loop over the whole list of OIDs only to find each time
> > that it has nothing to do.
>
> Good catch. Thanks!

Actually, looking at this more closely, I don't see the problem.
I'm assuming you meant the continue after "if (relid ==InvalidOid)":

    /*
     * 2. While there are items that have not been visited...
     */
    for(oid_idx = 0; oid_idx < numoids; )
    {
 ...
        /*
         * 4a. Skip past the highest_oid_claimed to find the next oid to work on.
         */
        relid = InvalidOid;
        while (relid <= highest_oid_claimed && oid_idx < numoids)
            relid = table_oids[oid_idx++];
        if (relid <= highest_oid_claimed)
            relid = InvalidOid;
        /* 4b. Claim the chosen table by storing its oid in shared memory. */
        MyWorkerInfo->wi_tableoid = relid;

        LWLockRelease(AutovacuumLock);

        /* If we reached the end of the list there is nothing to do. */
        if (relid == InvalidOid)
            continue;
 ...
    }

At the continue oid_idx will have been incremented by:

        while (relid <= highest_oid_claimed && oid_idx < numoids)
            relid = table_oids[oid_idx++];

to be equal to numoids so the for loop will exit immediately as the loop
condition is "oid_idx < numoids".

"break" would work as well, but basically there is only one reason to exit
the for loop, exhausting the list of oids, so continue seemed preferable.

Or am I missing something?

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Jim Nasby

Date:

18 March 2016, 07:16:09

On 3/15/16 4:28 PM, David Gould wrote:
> The more I learned about autovacuum
> scheduling the less it made sense. Really, there should be some sort of
> priority order for vacuuming based on some metric of need and tables should be
> processed in that order.

+1. What's there now is incredibly braindead.

I actually wonder if instead of doing all the the hard way in C whether
we should just use SPI for each worker to build it's list of tables. The
big advantage that would provide is the ability for users to customize
the scheduling, but I suspect it'd make the code simpler too.

The same is also true for deciding what database needs to be vacuumed next.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tom Lane

Date:

18 March 2016, 13:39:46

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> I actually wonder if instead of doing all the the hard way in C whether
> we should just use SPI for each worker to build it's list of tables. The
> big advantage that would provide is the ability for users to customize
> the scheduling, but I suspect it'd make the code simpler too.

By that you mean "user can write a SQL query that determines autovacuum
targets"?  -1.  That would bring us back to the bad old days where a
poorly-thought-out vacuum cron job would miss tables and lead to a
database shutdown.  Not to mention SQL injection risks.

If we need to improve autovac's strategy, let's do that, but not by
deeming it the user's problem.

            regards, tom lane

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

18 March 2016, 22:08:27

On Fri, 18 Mar 2016 09:39:34 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> > I actually wonder if instead of doing all the the hard way in C whether
> > we should just use SPI for each worker to build it's list of tables. The
> > big advantage that would provide is the ability for users to customize
> > the scheduling, but I suspect it'd make the code simpler too.
>
> By that you mean "user can write a SQL query that determines autovacuum
> targets"?  -1.  That would bring us back to the bad old days where a
> poorly-thought-out vacuum cron job would miss tables and lead to a
> database shutdown.  Not to mention SQL injection risks.
>
> If we need to improve autovac's strategy, let's do that, but not by
> deeming it the user's problem.

I have some thoughts for a different approach. In short, the stats collector
actually knows what needs vacuuming because queries that create dead tuples
tell it. I'm considering have the stats collector maintain a queue of
vacuum work and that autovacuum request work from the stats collector. When I
have something more concrete I'll post it on hackers.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Alvaro Herrera

Date:

18 March 2016, 22:13:32

David Gould wrote:

> I have some thoughts for a different approach. In short, the stats collector
> actually knows what needs vacuuming because queries that create dead tuples
> tell it. I'm considering have the stats collector maintain a queue of
> vacuum work and that autovacuum request work from the stats collector. When I
> have something more concrete I'll post it on hackers.

Hm, ideally we would have something backpatchable, but this new idea
doesn't sound so to me.  I think that having some variant of the patch
you proposed previously would be better for the stable branches.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tomas Vondra

Date:

18 March 2016, 22:17:38

Hi,

On 03/18/2016 11:13 PM, Alvaro Herrera wrote:
> David Gould wrote:
>
>> I have some thoughts for a different approach. In short, the stats collector
>> actually knows what needs vacuuming because queries that create dead tuples
>> tell it. I'm considering have the stats collector maintain a queue of
>> vacuum work and that autovacuum request work from the stats collector. When I
>> have something more concrete I'll post it on hackers.
>
> Hm, ideally we would have something backpatchable, but this new idea
> doesn't sound so to me.  I think that having some variant of the patch
> you proposed previously would be better for the stable branches.

I believe implementing such queue in the pgstat will be rather
difficult, for example because you can tweak the limits in reloptions,
and those are only accessible in the database. So pgstat does not know
that and thus can't really decide what needs autovacuuming etc.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

Tom Lane

Date:

18 March 2016, 22:24:00

David Gould <daveg@sonic.net> writes:
> I have some thoughts for a different approach. In short, the stats collector
> actually knows what needs vacuuming because queries that create dead tuples
> tell it. I'm considering have the stats collector maintain a queue of
> vacuum work and that autovacuum request work from the stats collector. When I
> have something more concrete I'll post it on hackers.

Uh, what?  The autovacuum code already looks at the stats maintained by
the collector.  If what you said means anything, it means "let's move the
autovac scheduling logic into the collector", which seems neither useful
nor sound from a modularity standpoint.

            regards, tom lane

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

19 March 2016, 08:12:32

On Fri, 18 Mar 2016 19:13:17 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> David Gould wrote:
>
> > I have some thoughts for a different approach. In short, the stats collector
> > actually knows what needs vacuuming because queries that create dead tuples
> > tell it. I'm considering have the stats collector maintain a queue of
> > vacuum work and that autovacuum request work from the stats collector. When I
> > have something more concrete I'll post it on hackers.
>
> Hm, ideally we would have something backpatchable, but this new idea
> doesn't sound so to me.  I think that having some variant of the patch
> you proposed previously would be better for the stable branches.

Sorry for the alarm. My plan is to update the current patch to address the
problem you described. This is intended to be backpatchable. The other
thoughts would be for new development for a future release.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From

David Gould

Date:

19 March 2016, 08:20:24

On Fri, 18 Mar 2016 18:23:51 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> David Gould <daveg@sonic.net> writes:
> > I have some thoughts for a different approach. In short, the stats collector
> > actually knows what needs vacuuming because queries that create dead tuples
> > tell it. I'm considering have the stats collector maintain a queue of
> > vacuum work and that autovacuum request work from the stats collector. When I
> > have something more concrete I'll post it on hackers.
>
> Uh, what?  The autovacuum code already looks at the stats maintained by
> the collector.  If what you said means anything, it means "let's move the
> autovac scheduling logic into the collector", which seems neither useful
> nor sound from a modularity standpoint.

Well, there is that. Thats why I'm still considering and not yet posting a
concrete proposal on hackers. Really, there is no convenient location for this
decision making as no single process has all the information needed to
optimize autovacuum scheduling across databases. It's a bit of a puzzle.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.