Re: autovacuum stress-testing our system - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: autovacuum stress-testing our system |
Date | |
Msg-id | 50A965EF.50708@fuzzy.cz Whole thread Raw |
In response to | Re: autovacuum stress-testing our system (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: autovacuum stress-testing our system
|
List | pgsql-hackers |
Hi! On 26.9.2012 19:18, Jeff Janes wrote: > On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: >>>> On 26-09-2012 09:43, Tomas Vondra wrote: >>>>> 5) splitting the single stat file into multiple pieces - e.g. per database, >>>>> written separately, so that the autovacuum workers don't need to read all >>>>> the data even for databases that don't need to be vacuumed. This might be >>>>> combined with (4). >> >>>> IMHO that's the definitive solution. It would be one file per database plus a >>>> global one. That way, the check would only read the global.stat and process >>>> those database that were modified. Also, an in-memory map could store that >>>> information to speed up the checks. >> >>> +1 >> >> That would help for the case of hundreds of databases, but how much >> does it help for lots of tables in a single database? > > It doesn't help that case, but that case doesn't need much help. If > you have N statistics-kept objects in total spread over M databases, > of which T objects need vacuuming per naptime, the stats file traffic > is proportional to N*(M+T). If T is low, then there is generally is > no problem if M is also low. Or at least, the problem is much smaller > than when M is high for a fixed value of N. I've done some initial hacking on splitting the stat file into multiple smaller pieces over the weekend, and it seems promising (at least with respect to the issues we're having). See the patch attached, but be aware that this is a very early WIP (or rather a proof of concept), so it has many rough edges (read "sloppy coding"). I haven't even added it to the commitfest yet ... The two main changes are these: (1) The stats file is split into a common "db" file, containing all the DB Entries, and per-database files with tables/functions. The common file is still called "pgstat.stat", the per-db files have the database OID appended, so for example "pgstat.stat.12345" etc. This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile functions, introducing two new functions: pgstat_read_db_statsfile pgstat_write_db_statsfile that do the trick of reading/writing stat file for one database. (2) The pgstat_read_statsfile has an additional parameter "onlydbs" that says that you don't need table/func stats - just the list of db entries. This is used for autovacuum launcher, which does not need to read the table/stats (if I'm reading the code in autovacuum.c correctly - it seems to be working as expected). So what are the benefits? (a) When a launcher asks for info about databases, something like this is called in the end: pgstat_read_db_statsfile(InvalidOid, false, true) which means all databases (InvalidOid) and only db info (true). So it reads only the one common file with db entries, not the table/func stats. (b) When a worker asks for stats for a given DB, something like this is called in the end: pgstat_read_db_statsfile(MyDatabaseId, false, false) which reads only the common stats file (with db entries) and only one file for the one database. The current implementation (with the single pgstat.stat file), all the data had to be read (and skipped silently) in both cases. That's a lot of CPU time, and we're seeing ~60% of CPU spent on doing just this (writing/reading huge statsfile). So with a lot of databases/objects, this "pgstat.stat split" saves us a lot of CPU ... (c) This should lower the space requirements too - with a single file, you actually need at least 2x the disk space (or RAM, if you're using tmpfs as we are), because you need to keep two versions of the file at the same time (pgstat.stat and pgstat.tmp). Thanks to this split you only need additional space for a copy of the largest piece (with some reasonable safety reserve). Well, it's very early patch, so there are rough edges too (a) It does not solve the "many-schema" scenario at all - that'll need a completely new approach I guess :-( (b) It does not solve the writing part at all - the current code uses a single timestamp (last_statwrite) to decide if a new file needs to be written. That clearly is not enough for multiple files - there should be one timestamp for each database/file. I'm thinking about how to solve this and how to integrate it with pgstat_send_inquiry etc. One way might be adding the timestamp(s) into PgStat_StatDBEntry and the other one is using an array of inquiries for each database. And yet another one I'm thinking about is using a fixed-length array of timestamps (e.g. 256), indexed by mod(dboid,256). That would mean stats for all databases with the same mod(oid,256) would be written at the same time. Seems like an over-engineering though. (c) I'm a bit worried about the number of files - right now there's one for each database and I'm thinking about splitting them by type (one for tables, one for functions) which might make it even faster for some apps with a lot of stored procedures etc. But is the large number of files actually a problem? After all, we're using one file per relation fork in the "base" directory, so this seems like a minor issue. And if really an issue, this might be solved by the mod(oid,256) to combine multiple files into one (which would work neatly with the fixed-length array of timestamps). kind regards Tomas
Attachment
pgsql-hackers by date: