Re: autovacuum stress-testing our system - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: autovacuum stress-testing our system
Date
Msg-id 50A965EF.50708@fuzzy.cz
Whole thread Raw
In response to Re: autovacuum stress-testing our system  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: autovacuum stress-testing our system  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi!

On 26.9.2012 19:18, Jeff Janes wrote:
> On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
>>>> On 26-09-2012 09:43, Tomas Vondra wrote:
>>>>> 5) splitting the single stat file into multiple pieces - e.g. per database,
>>>>> written separately, so that the autovacuum workers don't need to read all
>>>>> the data even for databases that don't need to be vacuumed. This might be
>>>>> combined with (4).
>>
>>>> IMHO that's the definitive solution. It would be one file per database plus a
>>>> global one. That way, the check would only read the global.stat and process
>>>> those database that were modified. Also, an in-memory map could store that
>>>> information to speed up the checks.
>>
>>> +1
>>
>> That would help for the case of hundreds of databases, but how much
>> does it help for lots of tables in a single database?
>
> It doesn't help that case, but that case doesn't need much help.  If
> you have N statistics-kept objects in total spread over M databases,
> of which T objects need vacuuming per naptime, the stats file traffic
> is proportional to N*(M+T).  If T is low, then there is generally is
> no problem if M is also low.  Or at least, the problem is much smaller
> than when M is high for a fixed value of N.

I've done some initial hacking on splitting the stat file into multiple
smaller pieces over the weekend, and it seems promising (at least with
respect to the issues we're having).

See the patch attached, but be aware that this is a very early WIP (or
rather a proof of concept), so it has many rough edges (read "sloppy
coding"). I haven't even added it to the commitfest yet ...

The two main changes are these:

(1) The stats file is split into a common "db" file, containing all the
    DB Entries, and per-database files with tables/functions. The common
    file is still called "pgstat.stat", the per-db files have the
    database OID appended, so for example "pgstat.stat.12345" etc.

    This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
    functions, introducing two new functions:

       pgstat_read_db_statsfile
       pgstat_write_db_statsfile

    that do the trick of reading/writing stat file for one database.

(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
    says that you don't need table/func stats - just the list of db
    entries. This is used for autovacuum launcher, which does not need
    to read the table/stats (if I'm reading the code in autovacuum.c
    correctly - it seems to be working as expected).

So what are the benefits?

(a) When a launcher asks for info about databases, something like this
    is called in the end:

       pgstat_read_db_statsfile(InvalidOid, false, true)

    which means all databases (InvalidOid) and only db info (true). So
    it reads only the one common file with db entries, not the
    table/func stats.

(b) When a worker asks for stats for a given DB, something like this is
    called in the end:

       pgstat_read_db_statsfile(MyDatabaseId, false, false)

    which reads only the common stats file (with db entries) and only
    one file for the one database.

    The current implementation (with the single pgstat.stat file), all
    the data had to be read (and skipped silently) in both cases.
    That's a lot of CPU time, and we're seeing ~60% of CPU spent on
    doing just this (writing/reading huge statsfile).

    So with a lot of databases/objects, this "pgstat.stat split" saves
    us a lot of CPU ...

(c) This should lower the space requirements too - with a single file,
    you actually need at least 2x the disk space (or RAM, if you're
    using tmpfs as we are), because you need to keep two versions of
    the file at the same time (pgstat.stat and pgstat.tmp).

    Thanks to this split you only need additional space for a copy of
    the largest piece (with some reasonable safety reserve).


Well, it's very early patch, so there are rough edges too

(a) It does not solve the "many-schema" scenario at all - that'll need
    a completely new approach I guess :-(

(b) It does not solve the writing part at all - the current code uses a
    single timestamp (last_statwrite) to decide if a new file needs to
    be written.

    That clearly is not enough for multiple files - there should be one
    timestamp for each database/file. I'm thinking about how to solve
    this and how to integrate it with pgstat_send_inquiry etc.

    One way might be adding the timestamp(s) into PgStat_StatDBEntry
    and the other one is using an array of inquiries for each database.

    And yet another one I'm thinking about is using a fixed-length
    array of timestamps (e.g. 256), indexed by mod(dboid,256). That
    would mean stats for all databases with the same mod(oid,256) would
    be written at the same time. Seems like an over-engineering though.

(c) I'm a bit worried about the number of files - right now there's one
    for each database and I'm thinking about splitting them by type
    (one for tables, one for functions) which might make it even faster
    for some apps with a lot of stored procedures etc.

    But is the large number of files actually a problem? After all,
    we're using one file per relation fork in the "base" directory, so
    this seems like a minor issue.

    And if really an issue, this might be solved by the mod(oid,256) to
    combine multiple files into one (which would work neatly with the
    fixed-length array of timestamps).

kind regards
Tomas


Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: WIP: store additional info in GIN index
Next
From: Marko Tiikkaja
Date:
Subject: Re: pg_dump --split patch