Thread: High inserts, bulk deletes - autovacuum vs scheduled vacuum

High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
"Jeremy Haile"
Date:
I am developing an application that has very predictable database
operations:
  -inserts several thousand rows into 3 tables every 5 minutes. (table
  contain around 10 million rows each)
  -truncates and rebuilds aggregate tables of this data every 5 minutes.
  (several thousand rows each)
  -regular reads of aggregate table and sometimes large tables by user
  interaction
  -every night, hundreds of thousands of rows are deleted from these 3
  tables (old data)
  -20-30 other tables get inserted/updated slowly throughout the day

In order to optimize performance of the inserts, I disabled
autovacuum/row-level stats and instead run "vacuum analyze" on the whole
DB every hour.  However this operation takes around 20 minutes of each
hour.  This means that the database is involved in vacuum/analyzing
tables 33% of the time.

I'd like any performance advice, but my main concern is the amount of
time vacuum/analyze runs and its possible impact on the overall DB
performance.  Thanks!


I am running 8.2 (will be 8.2.1 soon).  The box is Windows with 2GB RAM
connected to a SAN over fiber.  The data and pg_xlog are on separate
partitions.

Modified configuration:
effective_cache_size = 1000MB
random_page_cost = 3
default_statistics_target = 50
maintenance_work_mem = 256MB
shared_buffers = 400MB
temp_buffers = 10MB
work_mem = 10MB
max_fsm_pages = 1500000
checkpoint_segments = 30
stats_row_level = off
stats_start_collector = off

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
Florian Weimer
Date:
* Jeremy Haile:

> I'd like any performance advice, but my main concern is the amount of
> time vacuum/analyze runs and its possible impact on the overall DB
> performance.  Thanks!

You could partition your data tables by date and discard old data
simply by dropping the tables.  This is far more effective than
vacuuming, but obviously, this approach cannot be used in all cases
(e.g. if you need more dynamic expiry rules).

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: High inserts, bulk deletes - autovacuum vs scheduled

From
"Jeremy Haile"
Date:
Good advice on the partitioning idea.  I may have to restructure some of
my queries, since some of them query across the whole range - but it may
be a much more performant solution.  How is the performance when
querying across a set of partitioned tables vs. querying on a single
table with all rows?  This may be a long term idea I could tackle, but
is probably not feasible for my current time-frame.

Does my current approach of disabling autovacuum and manually vacuuming
once-an-hour sound like a good idea, or would I likely have better
results by auto-vacuuming and turning row-level stats back on?


On Tue, 09 Jan 2007 19:02:25 +0100, "Florian Weimer" <fweimer@bfk.de>
said:
> * Jeremy Haile:
>
> > I'd like any performance advice, but my main concern is the amount of
> > time vacuum/analyze runs and its possible impact on the overall DB
> > performance.  Thanks!
>
> You could partition your data tables by date and discard old data
> simply by dropping the tables.  This is far more effective than
> vacuuming, but obviously, this approach cannot be used in all cases
> (e.g. if you need more dynamic expiry rules).
>
> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Kriegsstraße 100              tel: +49-721-96201-1
> D-76133 Karlsruhe             fax: +49-721-96201-99

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
Florian Weimer
Date:
* Jeremy Haile:

> Good advice on the partitioning idea.  I may have to restructure some of
> my queries, since some of them query across the whole range - but it may
> be a much more performant solution.  How is the performance when
> querying across a set of partitioned tables vs. querying on a single
> table with all rows?

Locality of access decreases, of course, and depending on your data
size, you hit something like to 2 or 4 additional disk seeks per
partition for index-based accesses.  Sequential scans are not
impacted.

> Does my current approach of disabling autovacuum and manually vacuuming
> once-an-hour sound like a good idea, or would I likely have better
> results by auto-vacuuming and turning row-level stats back on?

Sorry, I haven't got much experience with autovacuum, since most of
other databases are INSERT-only (or get VACUUMed automatically after
major updates).

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
"Jim C. Nasby"
Date:
On Tue, Jan 09, 2007 at 12:26:41PM -0500, Jeremy Haile wrote:
> I am developing an application that has very predictable database
> operations:
>   -inserts several thousand rows into 3 tables every 5 minutes. (table
>   contain around 10 million rows each)
>   -truncates and rebuilds aggregate tables of this data every 5 minutes.
>   (several thousand rows each)
>   -regular reads of aggregate table and sometimes large tables by user
>   interaction
>   -every night, hundreds of thousands of rows are deleted from these 3
>   tables (old data)
>   -20-30 other tables get inserted/updated slowly throughout the day
>
> In order to optimize performance of the inserts, I disabled
> autovacuum/row-level stats and instead run "vacuum analyze" on the whole
> DB every hour.  However this operation takes around 20 minutes of each
> hour.  This means that the database is involved in vacuum/analyzing
> tables 33% of the time.
>
> I'd like any performance advice, but my main concern is the amount of
> time vacuum/analyze runs and its possible impact on the overall DB
> performance.  Thanks!

If much of the data in the database isn't changing that often, then why
continually re-vacuum the whole thing?

I'd suggest trying autovacuum and see how it does (though you might want
to tune it to be more or less aggressive, and you'll probably want to
enable the cost delay).

The only cases where manual vacuum makes sense to me is if you've got a
defined slow period and vacuuming during that slow period is still
frequent enough to keep up with demand, or if you've got tables that
have a very high churn rate and need to be kept small. In the later
case, I'll usually setup a cronjob to vacuum those tables once a minute
with no cost delay. I'm sure there might be some other cases where not
using autovac might make sense, but generally I'd much rather let
autovac worry about this so I don't have to.

> I am running 8.2 (will be 8.2.1 soon).  The box is Windows with 2GB RAM
> connected to a SAN over fiber.  The data and pg_xlog are on separate
> partitions.
>
> Modified configuration:
> effective_cache_size = 1000MB
> random_page_cost = 3
> default_statistics_target = 50
> maintenance_work_mem = 256MB
> shared_buffers = 400MB
> temp_buffers = 10MB
> work_mem = 10MB
> max_fsm_pages = 1500000

One other useful manual vacuum to consider is running vacuumdb -av
periodically (say, once a month) and looking at the last few lines of
output. That will give you a good idea on how large you should set
max_fsm_pages. Running the output of vacuumdb -av through pgFouine will
give you other useful data.

> checkpoint_segments = 30
> stats_row_level = off
> stats_start_collector = off

Unless you're really trying to get the last ounce of performance out,
it's probably not worth turning those stats settings off.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
"Jim C. Nasby"
Date:
Please cc the list so others can help and learn.

On Wed, Jan 10, 2007 at 03:43:00PM -0500, Jeremy Haile wrote:
> > I'd suggest trying autovacuum and see how it does (though you might want
> > to tune it to be more or less aggressive, and you'll probably want to
> > enable the cost delay).
>
> What are some decent default values for the cost delay vacuum settings?
> I haven't used these before.

I find that simply setting vacuum_cost_delay to 20 is generally a good
starting point. I'll usually do that and then run a vacuum while
watching disk activity; I try and tune it so that the disk is ~90%
utilized with vacuum running. That allows a safety margin without
stretching vacuums out forever.

> Also - do the default autovacuum settings make sense for tables on the
> scale of 10 million rows?  For example, using the defaults it would
> require about a  million rows (250 + 0.1 * 10 million) to be
> inserted/updated/deleted before analyzing - which  seems high.  (about 2
> million for vacuum)  Or am I overestimating how often I would need to
> vacuum/analyze these tables?

Depends on your application... the way I look at it is that a setting of
0.1 means 10% dead space in the table. While 5% or 1% would be better,
you hit a point of diminishing returns since you have to read the entire
table and it's indexes to vacuum it.

BTW, that's the default values for analyze... the defaults for vacuum
are 2x that.

> Do most people use the default autovacuum settings successfully, or are
> they usually modified?

I generally use the 8.2 defaults (which are much better than the 8.1
defaults) unless I'm really trying to tune things. What's more important
is to make sure critical tables (such as queue tables) are getting
vacuumed frequently so that they stay small. (Of course you also need to
ensure there's no long running transactions).
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: High inserts, bulk deletes - autovacuum vs scheduled

From
"Jeremy Haile"
Date:
> BTW, that's the default values for analyze... the defaults for vacuum
> are 2x that.

Yeah - I was actually more concerned that tables would need to be
analyzed more often than I was about vacuuming too often, so I used
analyze as the example.  Since my app is inserting constantly throughout
the day and querying for "recent" data - I want to make sure the query
planner realizes that there are lots of rows with new timestamps on
them.  In other words, if I run a query "select * from mytable where
timestamp > '9:00am'" - I want to make sure it hasn't been a day since
the table was analyzed, so the planner thinks there are zero rows
greater than 9:00am today.

> What's more important
> is to make sure critical tables (such as queue tables) are getting
> vacuumed frequently so that they stay small.

Is the best way to do that usually to lower the scale factors?  Is it
ever a good approach to lower the scale factor to zero and just set the
thresholds to a pure number of rows? (when setting it for a specific
table)

Thanks,
Jeremy Haile

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
"Jim C. Nasby"
Date:
On Wed, Jan 10, 2007 at 04:48:42PM -0500, Jeremy Haile wrote:
> > BTW, that's the default values for analyze... the defaults for vacuum
> > are 2x that.
>
> Yeah - I was actually more concerned that tables would need to be
> analyzed more often than I was about vacuuming too often, so I used
> analyze as the example.  Since my app is inserting constantly throughout
> the day and querying for "recent" data - I want to make sure the query
> planner realizes that there are lots of rows with new timestamps on
> them.  In other words, if I run a query "select * from mytable where
> timestamp > '9:00am'" - I want to make sure it hasn't been a day since
> the table was analyzed, so the planner thinks there are zero rows
> greater than 9:00am today.

Well, analyze is pretty cheap. At most it'll read only 30,000 pages,
which shouldn't take terribly long on a decent system. So you can be a
lot more aggressive with it.

> > What's more important
> > is to make sure critical tables (such as queue tables) are getting
> > vacuumed frequently so that they stay small.
>
> Is the best way to do that usually to lower the scale factors?  Is it
> ever a good approach to lower the scale factor to zero and just set the
> thresholds to a pure number of rows? (when setting it for a specific
> table)

The problem is what happens if autovac goes off and starts vacuuming
some large table? While that's going on your queue table is sitting
there bloating. If you have a separate cronjob to handle the queue
table, it'll stay small, especially in 8.2.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:

> > Is the best way to do that usually to lower the scale factors?  Is it
> > ever a good approach to lower the scale factor to zero and just set the
> > thresholds to a pure number of rows? (when setting it for a specific
> > table)
>
> The problem is what happens if autovac goes off and starts vacuuming
> some large table? While that's going on your queue table is sitting
> there bloating. If you have a separate cronjob to handle the queue
> table, it'll stay small, especially in 8.2.

You mean "at least in 8.2".  In previous releases, you could vacuum
that queue table until you were blue on the face, but it would achieve
nothing because it would consider that the dead tuples were visible to a
running transaction: that running the vacuum on the large table.  This
is an annoyance that was fixed in 8.2.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: High inserts, bulk deletes - autovacuum vs scheduled vacuum

From
"Jim C. Nasby"
Date:
On Thu, Jan 11, 2007 at 12:10:34AM -0300, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
>
> > > Is the best way to do that usually to lower the scale factors?  Is it
> > > ever a good approach to lower the scale factor to zero and just set the
> > > thresholds to a pure number of rows? (when setting it for a specific
> > > table)
> >
> > The problem is what happens if autovac goes off and starts vacuuming
> > some large table? While that's going on your queue table is sitting
> > there bloating. If you have a separate cronjob to handle the queue
> > table, it'll stay small, especially in 8.2.
>
> You mean "at least in 8.2".  In previous releases, you could vacuum
> that queue table until you were blue on the face, but it would achieve
> nothing because it would consider that the dead tuples were visible to a
> running transaction: that running the vacuum on the large table.  This
> is an annoyance that was fixed in 8.2.

True, but in many environments there are other transactions that run
long enough that additional vacuums while a long vacuum was running
would still help.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)