autovacuum stress-testing our system - Mailing list pgsql-hackers

From Tomas Vondra
Subject autovacuum stress-testing our system
Date
Msg-id 1718942738eb65c8407fcd864883f4c8@fuzzy.cz
Whole thread Raw
Responses Re: autovacuum stress-testing our system  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: autovacuum stress-testing our system  (Jeff Janes <jeff.janes@gmail.com>)
Re: autovacuum stress-testing our system  (Euler Taveira <euler@timbira.com>)
PATCH: Split stats file per database WAS: autovacuum stress-testing our system  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Hi,

I've been struggling with autovacuum generating a lot of I/O and CPU on 
some of our
systems - after a night spent analyzing this behavior, I believe the 
current
autovacuum accidentally behaves a bit like a stress-test in some corner 
cases (but
I may be seriously wrong, after all it was a long night).

First - our system really is not a "common" one - we do have ~1000 of 
databases of
various size, each containing up to several thousands of tables 
(several user-defined
tables, the rest serve as caches for a reporting application - yes, 
it's a bit weird
design but that's life). This all leads to pgstat.stat significantly 
larger than 60 MB.

Now, the two main pieces of information from the pgstat.c are the timer 
definitions

---------------------------------- pgstat.c : 80 
----------------------------------

#define PGSTAT_STAT_INTERVAL    500   /* Minimum time between stats 
file                                       * updates; in milliseconds. */

#define PGSTAT_RETRY_DELAY      10    /* How long to wait between 
checks for                                       * a new file; in milliseconds. 
*/

#define PGSTAT_MAX_WAIT_TIME    10000 /* Maximum time to wait for a 
stats                                       * file update; in milliseconds. 
*/

#define PGSTAT_INQ_INTERVAL     640   /* How often to ping the 
collector for                                       * a new file; in milliseconds. 
*/

#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to 
restart a                                       * failed statistics collector; 
in                                       * seconds. */

#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / 
PGSTAT_RETRY_DELAY)
#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / 
PGSTAT_RETRY_DELAY)

-----------------------------------------------------------------------------------

and then this loop (the current HEAD does this a bit differently, but 
the 9.2 code
is a bit readable and suffers the same issue):

---------------------------------- pgstat.c : 3560 
--------------------------------
/* * Loop until fresh enough stats file is available or we ran out of 
time. * The stats inquiry message is sent repeatedly in case collector 
drops * it; but not every single time, as that just swamps the collector. */for (count = 0; count <
PGSTAT_POLL_LOOP_COUNT;count++){    TimestampTz file_ts = 0;
 
    CHECK_FOR_INTERRUPTS();
    if (pgstat_read_statsfile_timestamp(false, &file_ts) &&        file_ts >= min_ts)        break;
    /* Not there or too old, so kick the collector and wait a bit */    if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
pgstat_send_inquiry(min_ts);
 
    pg_usleep(PGSTAT_RETRY_DELAY * 1000L);}
if (count >= PGSTAT_POLL_LOOP_COUNT)    elog(WARNING, "pgstat wait timeout");
/* Autovacuum launcher wants stats about all databases */if (IsAutoVacuumLauncherProcess())    pgStatDBHash =
pgstat_read_statsfile(InvalidOid,false);else    pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
 

-----------------------------------------------------------------------------------

What this code does it that it checks the statfile, and if it's not 
stale (the
timestamp of the write start is not older than PGSTAT_RETRY_DELAY 
milliseconds),
the loop is terminated and the file is read.

Now, let's suppose the write takes >10 ms, which is the 
PGSTAT_RETRY_DELAY values.
With our current pgstat.stat filesize/num of relations, this is quite 
common.
Actually the common write time in our case is ~100 ms, even if we move 
the file
into tmpfs. That means that almost all the calls to 
backend_read_statsfile (which
happen in all pgstat_fetch_stat_*entry calls) result in continuous 
stream of
inquiries from the autovacuum workers, writing/reading of the file.

We're not getting 'pgstat wait timeout' though, because it finally gets 
written
before PGSTAT_MAX_WAIT_TIME.

By moving the file to a tmpfs we've minimized the I/O impact, but now 
the collector
and autovacuum launcher consume ~75% of CPU (i.e. ~ one core) and do 
nothing except
burning power because the database is almost read-only. Not a good 
thing in the
"green computing" era I guess.

First, I'm interested in feedback - did I get all the details right, or 
am I
missing something important?

Next, I'm thinking about ways to solve this:

1) turning of autovacuum, doing regular VACUUM ANALYZE from cron - 
certainly an   option, but it's rather a workaround than a solution and I'm not 
very fond of   it. Moreover it fixes only one side of the problem - triggering the 
statfile   writes over and over. The file will be written anyway, although not 
that   frequently.

2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY 
and so on   to consider several seconds to be fresh enough - Would be nice to 
have this   as a GUC variables, although we can do another private patch on our 
own. But   more knobs is not always better.

3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly 
on the time   it takes to write the file (e.g. 10x the write time or something).

4) keeping some sort of "dirty flag" in stat entries - and then writing 
only info   about objects were modified enough to be eligible for vacuum/analyze 
(e.g.   increasing number of index scans can't trigger autovacuum while 
inserting   rows can). Also, I'm not worried about getting a bit older num of 
index scans,   so 'clean' records might be written less frequently than 'dirty' 
ones.

5) splitting the single stat file into multiple pieces - e.g. per 
database,   written separately, so that the autovacuum workers don't need to 
read all   the data even for databases that don't need to be vacuumed. This 
might be   combined with (4).

Ideas? Objections? Preferred options?

I kinda like (4+5), although that'd be a pretty big patch and I'm not 
entirely
sure it can be done without breaking other things.

regards
Tomas



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_reorg in core?
Next
From: Antonin Houska
Date:
Subject: Re: Oid registry