Re: pg_autovacuum next steps - Mailing list pgsql-hackers

From Matthew T. O'Connor
Subject Re: pg_autovacuum next steps
Date
Msg-id 1079926579.13076.30.camel@zeudora.zeut.net
Whole thread Raw
In response to Re: pg_autovacuum next steps  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pg_autovacuum next steps
Re: pg_autovacuum next steps
Re: pg_autovacuum next steps
List pgsql-hackers
On Sun, 2004-03-21 at 18:12, Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
> > [ rtf document ]
>
> Please repost in some less proprietary format.  Plain text is generally
> considered the thing to use on this list.

I don't think RTF is proprietary but I should have just posted inline
anyway so here is a copy:

pg_autovacuum Version 2
Design Document:


Exec Summary:
pg_autovacuum was initially released as a contrib module in PostgreSQL
v7.4.   The version in 7.4 is by design very simple.  No configuration
is required, and very little configuration is possible.  Despite these
limitations it was voted the most popular new feature of PostgreSQL v7.4
according to the survey held on postgresql.org
(http://www.postgresql.org/survey.php?View=1&SurveyID=23).

Despite it's popularity there is much room for improvement.  This
document sets out to define the most important improvements that would
help pg_autovacuum to become a truly powerful asset to the suite of
tools that come with PostgreSQL.


Current Problems & Limitations:

Based on user feedback from people using pg_autovacuum in the field, and
my own observations, there are a number of problems and limitation with
pg_autovacuum.  They are:

* Inability to customize thresholds on a per table basis
* Inability to set default thresholds on a per database basis
* Inability to exclude specific databases / tables from pg_autovacuum
monitoring
* Inability to schedule vacuums during off-peak times
* Lack of integration related to startup and shutdown
* Ignorance of VACUUM and ANALYZE operations performed outside
pg_autovacuum (requires backend integration? or can listen / notify can
be used?)
* Lack of logging options / syslog integration / log rotation options
* Create table fails because template1 is busy

I'm not sure how to address all of these concerns, or that they all
should be addressed right now.  One of my big questions is backend
integration.  I am leaning towards leaving pg_autovacuum as a client
application in contrib for one more release.  During this time, I can
continue to tweak and improve pg_autovacuum so that we will have a very
good idea what the final product should be before we make it a standard
backend process.


For PostgreSQL 7.5,  I plan to implement these new features:
1.Per database defaults and per table thresholds (including total
exclusion)2.Persistent data3.Single-Pass Mode (external scheduling from cron etc...)4.Off peak scheduling

1. Per Database defaults and Per table Thresholds:

There are differing opinions as to the best way to providing these this
feature.  The primary debate is where to save the configuration data.  I
see three options:
1.Store config data inside a special pg_autovacuum table inside
existing databases that wants custom settings.
2.Use a config file.  This would require some additional coding to add
the required parsing, but is possible.
3.Create a pg_autovacuum database inside any cluster that wants to
customize their settings.

Since many people do not like tools that clutter their databases by
adding tables, I think option 1 (adding a pg_autovacuum table to
existing databases) is right out.  Using a config file would be Ok, but
would require additional parsing code.  My preference is option 3.
Since pg_autovacuum will (hopefully) eventually become an integrated
part of the backend, it will eventually be able to add required data to
the system catalogs.  Given these two premises, as long as pg_autovacuum
remains a contrib module it could use it's own database to mimic having
system tables.  If this database exists, it will be used, if it does not
exist, then pg_autovacuum will work just as it did in the 7.4 release
with very limited options available to it.  The user will be able to
specify a non-default database.

Table Structure for database specific defaults and table specific
thresholds:

databases_defaults: (will reference the pg_class system table)id                     serial primary keyexclude_database
          booleandefault_vacuum_scaling_factor     floatdefault_vacuum_base_value
intdefault_analyze_scaling_factor   floatdefault_analyze_base_value      intdboid                    oid references
pg_database.oid
table_thresholdsid             serial primary keyexclude_table        boolean    (exclude this
table)vacuum_scaling_factor    float    (equivalent to  -v)vacuum_base_value      int    (equivalent to
-V)vacuum_threshold   float    (if > 0, use this threshold)analyze_scaling_factor    float    (equivalent to
-a)analyze_base_value     int    (equivalent to -A)analyze_threshold    float    (if > 0 use this threshold)relid
    oid references pg_classs.relid 


2.Persistent pg_autovacuum Data:

Right now pg_autovacuum has no memory of what was going on the last time
it was run.  So if significant changes have happened while pg_autovacuum
is not running, they will not be counted in the analysis of when to
perform a vacuum or analyze operation which can result in under
vacuuming.  So, pg_autovacuum should occasionally write down it's
numbers to the database.  The data  will be stored in an additional
table called table_data

table_dataid             serial primary keyCountAtLastAnalyze     long CountAtLastVacuum     longtable_thresholds_id
intreferences table_thresholds 


3.Single-Pass Mode (External Scheduling):

I have received requests to be able to run pg_autovacuum only on request
(not as a daemon) making only one pass over all the tables (not looping
indefinately).  The advantage being that it will operate more like the
current vacuum command except that it will only vacuum tables that need
to be vacuumed.  This feature could be useful as long as pg_autovacuum
exists outside the backend.  If pg_autovacuum gets integrated into the
backend and gets automatically started as a daemon during startup, then
this option will no longer make sense.

Once we have persistent data (Step 2) then we can easily operate in
Single-Pass Mode.


4.Off-Peak Scheduling:

A fundamental advantage of our vacuum system is that the work required
to reclaim table space is taken out of the critical path and can be
moved to and off-peak time when cycles are less precious.  One of the
drawbacks of the current pg_autovacuum is that it doesn't have any way
to factor this in.

In it's simplest form (which I will implement first) I would add the
ability to add a second set of thresholds that will be active only
during an “off-peak” time that can be specified in the pg_autovacuum
database, perhaps in a general_settings table.





A few notes about things I'm not planning on working on at the moment.

Better logging options:

An additional logging could would be added to the pg_autovacuum database
and will log all activity (vacuums and analyzes) along with their
corresponding duration.

Syslog support.  I'm not sure this is really needed, but a simple patch
was submitted by one user and perhaps that can be reviewed / improved
and applied.






pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: execute command tag including affected rows count
Next
From: Gavin Sherry
Date:
Subject: Re: pg_autovacuum next steps