Thread: Vacuum thoughts

Vacuum thoughts

From
Greg Stark
Date:
The more I think about this vacuum i/o problem, the more I think we have it
wrong. The added i/o from vacuum really ought not be any worse than a single
full table scan. And there are probably the occasional query doing full table
scans already in those systems.

For the folks having this issue, if you run "select count(*) from bigtable" is
there as big a hit in transaction performance? On the other hand, does the
vacuum performance hit kick in right away? Or only after it's been running for
a bit?

I think the other factor mentioned is actually the main problem: cache. The
vacuum basically kills the kernel buffer cache by reading in every block of
every table in the system. The difference between vacuum and a single "select
count(*)" is that it does all the tables one after each other eventually
overrunning the total cache available.

If it's just a matter of all the read i/o from vacuum then we're best off
sleeping for a few milliseconds every few kilobytes. If it's the cache then
we're probably better off reading a few megabytes and then sleeping for
several seconds to allow the other buffers to get touched and pushed back to
the front of the LRU.

Hm, I wonder if the amount of data to read between sleeps should be, something
like 25% of the effective_cache_size, for example.

-- 
greg



Re: Vacuum thoughts

From
Neil Conway
Date:
On Fri, 2003-10-17 at 16:22, Greg Stark wrote:
> If it's just a matter of all the read i/o from vacuum then we're best off
> sleeping for a few milliseconds every few kilobytes. If it's the cache then
> we're probably better off reading a few megabytes and then sleeping for
> several seconds to allow the other buffers to get touched and pushed back to
> the front of the LRU.

Uh, no -- if it is the cache, we're better off fixing the buffer
replacement policy, not trying to hack around it. Replacement policies
that don't suffer from sequential flooding are well known.

-Neil




Re: Vacuum thoughts

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Uh, no -- if it is the cache, we're better off fixing the buffer
> replacement policy, not trying to hack around it.

If we can.  As long as we are largely depending on the kernel's buffer
cache, we may not be able to "just fix it" ...
        regards, tom lane


Re: Vacuum thoughts

From
Gaetano Mendola
Date:
Greg Stark wrote:
> The more I think about this vacuum i/o problem, the more I think we have it
> wrong. The added i/o from vacuum really ought not be any worse than a single
> full table scan. And there are probably the occasional query doing full table
> scans already in those systems.
> 
> For the folks having this issue, if you run "select count(*) from bigtable" is
> there as big a hit in transaction performance? On the other hand, does the
> vacuum performance hit kick in right away? Or only after it's been running for
> a bit?

The vacuum cost is the same of a full scan table ( select count(*) ) ?
Why not do a sort of "vacuum" if a scan table happen ( during a simple
select that invole a full scan table for example )?


Regards
Gaetano Mendola




Re: Vacuum thoughts

From
Shridhar Daithankar
Date:
Gaetano Mendola wrote:
> The vacuum cost is the same of a full scan table ( select count(*) ) ?
> Why not do a sort of "vacuum" if a scan table happen ( during a simple
> select that invole a full scan table for example )?

I was thinking about it. How about vacuuming a page when it is been pushed out 
of postgresql buffer cache? It is is memory so not much IO is involved.

Could it be an approach?
 Shridhar



Re: Vacuum thoughts

From
Tom Lane
Date:
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> I was thinking about it. How about vacuuming a page when it is been
> pushed out of postgresql buffer cache? It is is memory so not much IO
> is involved.

You keep ignoring the problem of removing index entries.  To vacuum an
individual page, you need to be willing to read in (and update) all
index pages that reference the tuples-to-be-deleted.  This is hardly
tenable when the reason for pushing the page out of buffer cache was so
that you could read in something else instead --- you don't have spare
buffer slots, and you don't want to do all that I/O (and the associated
WAL log entries) before you can read in the page you originally wanted.

The latter point is really the crux of the problem.  The point of having
the VACUUM process is to keep maintenance work out of the critical path
of foreground queries.  Anything that moves even part of that
maintenance work into the critical path is going to be a net loss.
        regards, tom lane


Re: Vacuum thoughts

From
Shridhar Daithankar
Date:
Tom Lane wrote:

> Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> 
>>I was thinking about it. How about vacuuming a page when it is been
>>pushed out of postgresql buffer cache? It is is memory so not much IO
>>is involved.
> 
> 
> You keep ignoring the problem of removing index entries.  To vacuum an
> individual page, you need to be willing to read in (and update) all
> index pages that reference the tuples-to-be-deleted.  This is hardly
> tenable when the reason for pushing the page out of buffer cache was so
> that you could read in something else instead --- you don't have spare
> buffer slots, and you don't want to do all that I/O (and the associated
> WAL log entries) before you can read in the page you originally wanted.
> 
> The latter point is really the crux of the problem.  The point of having
> the VACUUM process is to keep maintenance work out of the critical path
> of foreground queries.  Anything that moves even part of that
> maintenance work into the critical path is going to be a net loss.

So the problem is an index tuple does not store transaction id information like 
a heap tuple does and it can not deduce that an index tuple is dead unless it 
points to a dead heap tuple.

Is that right?

If an index tuple had transaction information duplicated along with heap tuple, 
two types of tuples can be removed, independent of each other? Would above 
scheme of vacuum-on-page-expiry work in that case?
 Shridhar



Re: Vacuum thoughts

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> You keep ignoring the problem of removing index entries.  To vacuum an
> individual page, you need to be willing to read in (and update) all
> index pages that reference the tuples-to-be-deleted.  

Hm. If the visibility information were stored in the index then that could be
postponed and done when the index page was read or when it was flushed.

So when you do a sequential scan it marks any tuples it finds as dead as free
in the free space map. And when you do an index scan it marks any index
entries it finds dead as free in the free space map.

That wouldn't eliminate the need for vacuum though. It would be pretty easy to
delete chunks of tuples and never look at those index values again. Especially
since deleting them might not even bring the index pages in at all.

That said, I guess if you never look at the pages again, the wasted space in
them isn't having any performance penalty anyways.

> The latter point is really the crux of the problem.  The point of having
> the VACUUM process is to keep maintenance work out of the critical path
> of foreground queries.  Anything that moves even part of that
> maintenance work into the critical path is going to be a net loss.

I could envision a system where the process is guaranteed a certain number of
empty pages. Whenever it grabs one the last page on the LRU is queued to be
cleaned. It doesn't have to actually get cleaned and flushed in the critical
path unless the transaction starves of ready-to-use pages. If it does it's
probably a batch transaction anyways and should switch to reusing the same
pages over and over to avoid flushing everything from the shared pages.

-- 
greg



Re: Vacuum thoughts

From
Tom Lane
Date:
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> If an index tuple had transaction information duplicated along with
> heap tuple, two types of tuples can be removed, independent of each
> other?

Possibly ... but I think we have already considered and rejected that
proposal, more than once.
        regards, tom lane


Re: Vacuum thoughts

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > You keep ignoring the problem of removing index entries.  To vacuum an
> > individual page, you need to be willing to read in (and update) all
> > index pages that reference the tuples-to-be-deleted.  
> 
> Hm. If the visibility information were stored in the index then that could be
> postponed and done when the index page was read or when it was flushed.

Or possibly if we could reuse the space in the table without updating the
index and be able to recognize that the tuple no longer corresponded to the
index entry when we next saw the index entry.

I suspect that would take as much space to keep track of as the visibility
information though.

-- 
greg



Re: Vacuum thoughts

From
Christopher Browne
Date:
The world rejoiced as tgl@sss.pgh.pa.us (Tom Lane) wrote:
> The latter point is really the crux of the problem.  The point of having
> the VACUUM process is to keep maintenance work out of the critical path
> of foreground queries.  Anything that moves even part of that
> maintenance work into the critical path is going to be a net loss.

How about a "flip side" for this...

VACUUM CACHE;

This new operation vacuums only those pages of relations that are in
cache.  Tuples that have recently been trashed are quite likely to be
in cache because they had to be drawn in in order to mark them as
obsolete.  So if there is (say) a "balance table" where entries get
updated regularly, the dead tuples should be sitting in cache.

This should go "blip - walk through 10K pages of cache in memory" and
be done.  (Modulo a need to pull in some index pages to mark index
entries as dead...)
-- 
output = reverse("gro.mca" "@" "enworbbc")
http://www.ntlug.org/~cbbrowne/linux.html
Build a man a fire and he will be warm for a day.
Set a man on fire and he will be warm for the rest of his life.


Re: Vacuum thoughts

From
Tom Lane
Date:
Christopher Browne <cbbrowne@acm.org> writes:
> How about a "flip side" for this...
> VACUUM CACHE;

> This new operation vacuums only those pages of relations that are in
> cache.

This might conceivably be a useful behavior (modulo the problem of
fixing index entries) ... but I think we'd not want to expose it as
an actual VACUUM option.  If we were going to do something like that,
it should likely be done by an invisible automatic vacuum daemon,
more like checkpoints ...
        regards, tom lane


Re: Vacuum thoughts

From
Jan Wieck
Date:
To add some medium-hard data to the discussion, I hacked a PG 7.3.4 a 
little. The system I am talking about below run's an artificial 
application that very well resembles the behaviour of a TPC-C benchmark 
implementation. Without vacuuming the database, it can just so sustain a 
factor 5 scaled database running with 50 simulated terminals. To free 
some bandwidth, the system is configured with scaling 4 and runs with 40 
simulated terminals. In this configuration it can satisfy the 
responsetime requirements for 100% of all transactions when not 
vacuuming ... no surprise.

The test driver takes 10 minute intervals and reports the percentage of 
transactions which qualify.

If the database now is vacuumed simultaneously, the response time for 
transactions changes dramatically. A 10 minute interval hit by vacuum 
drops down from 100% to anything below 90%, I've seen it down to 75%. 
The system load given by a Linux 2.4 kernel jumps up from under 1.0 to 
anything between 5 and 8.

So far, that is exactly what most DBA's are complaining about. A system 
that runs smoothly otherwise get's literally bogged down by any vacuum.

Now I changed the cache policy. While a backend is running vacuum, a 
global flag is set. If this flag is set and a block is not found in the  cache but must be read, it's buffer is marked
BM_READ_BY_VACUUM.When 
 
the global flag is set, AddBufferToFreelist() inserts buffers so marked 
at the head of the freelist instead of adding them to the tail. In any 
case, the buffers BM_READ_BY_VACUUM flag is cleared.

The effect of this simple hack is somewhat surprising. Not only can the 
system keep satisfying 97% or more of all transactions within time 
limits and the system load stays well below 2.0 (I've only seen 1.6 
once), but very surprisingly VACUUM finishes about 20% faster too.

I'm not a friend of jumping to conclusions, OTOH I have to try to make 
some sense out of it. So I would like the following be taken with a 
reasonable amount of salt.

I think that the common theory, vacuum is similar to a sequential scan, 
just does not hold true for any table that is actually updated randomly. 
What happens instead is that vacuum not only evicts the whole buffer 
cache by forcing all blocks of said table and its indexes in, it also 
dirties a substantial amount of that and leaves the dirt to be cleaned 
up by all the other backends.

The changes I've done above cause vacuum to work with as few shared 
buffers as possible for the data not already found in the cache. This 
avoids imposing unnecessary additional write overhead for regular 
backends, and causes the vacuum process to stay inside of a few virtual 
memory pages instead of running all over the place. I don't know how 
much the latter impacts the efficiency of the MMU, it might not be 
significant here.

It is well possible that there is some other side effect in the buffer 
cache that impacts the behaviour of many backends doing few writes 
compared to one backend doing them en-gros.

However, the test indicates that there is some low hanging fruit in the 
cache algorithm, and that it's not just a few little raspberries.


Jan


Tom Lane wrote:

> Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
>> I was thinking about it. How about vacuuming a page when it is been
>> pushed out of postgresql buffer cache? It is is memory so not much IO
>> is involved.
> 
> You keep ignoring the problem of removing index entries.  To vacuum an
> individual page, you need to be willing to read in (and update) all
> index pages that reference the tuples-to-be-deleted.  This is hardly
> tenable when the reason for pushing the page out of buffer cache was so
> that you could read in something else instead --- you don't have spare
> buffer slots, and you don't want to do all that I/O (and the associated
> WAL log entries) before you can read in the page you originally wanted.
> 
> The latter point is really the crux of the problem.  The point of having
> the VACUUM process is to keep maintenance work out of the critical path
> of foreground queries.  Anything that moves even part of that
> maintenance work into the critical path is going to be a net loss.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Vacuum thoughts

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> What happens instead is that vacuum not only evicts the whole buffer 
> cache by forcing all blocks of said table and its indexes in, it also 
> dirties a substantial amount of that and leaves the dirt to be cleaned 
> up by all the other backends.

[ thinks about that... ]  Yeah, I believe you're right, because (plain)
vacuum just does WriteBuffer() for any page that it modifies, which only
marks the page dirty in buffer cache.  It never does anything to force
those pages to be written out to the kernel.  So, if you have a large
buffer cache, a lot of write work will be left over to be picked up by
other backends.

I think that pre-WAL the system used to handle this stuff differently,
in a way that made it more likely that VACUUM would issue its own
writes.  But optimizations intended to improve the behavior for
non-VACUUM cases have made this not so good for VACUUM.

I like your idea of penalizing VACUUM-read blocks when they go back into
the freelist.  This seems only a partial solution though, since it
doesn't directly ensure that VACUUM rather than some other process will
issue the write kernel call for the dirtied page.  Maybe we should
resurrect a version of WriteBuffer() that forces an immediate kernel
write, and use that in VACUUM.

Also, we probably need something similar for seqscan-read blocks, but
with an intermediate priority (can we insert them to the middle of the
freelist?)
        regards, tom lane


Re: Vacuum thoughts

From
Jan Wieck
Date:
Tom Lane wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>> What happens instead is that vacuum not only evicts the whole buffer 
>> cache by forcing all blocks of said table and its indexes in, it also 
>> dirties a substantial amount of that and leaves the dirt to be cleaned 
>> up by all the other backends.
> 
> [ thinks about that... ]  Yeah, I believe you're right, because (plain)
> vacuum just does WriteBuffer() for any page that it modifies, which only
> marks the page dirty in buffer cache.  It never does anything to force
> those pages to be written out to the kernel.  So, if you have a large
> buffer cache, a lot of write work will be left over to be picked up by
> other backends.
> 
> I think that pre-WAL the system used to handle this stuff differently,
> in a way that made it more likely that VACUUM would issue its own
> writes.  But optimizations intended to improve the behavior for
> non-VACUUM cases have made this not so good for VACUUM.
> 
> I like your idea of penalizing VACUUM-read blocks when they go back into
> the freelist.  This seems only a partial solution though, since it
> doesn't directly ensure that VACUUM rather than some other process will
> issue the write kernel call for the dirtied page.  Maybe we should
> resurrect a version of WriteBuffer() that forces an immediate kernel
> write, and use that in VACUUM.
> 
> Also, we probably need something similar for seqscan-read blocks, but
> with an intermediate priority (can we insert them to the middle of the
> freelist?)

Well, "partial solution" isn't quite what I would call it, and it surely 
needs integration with sequential scans. I really do expect the whole 
hack to fall apart if some concurrent seqscans are going on since it not 
really penalizes the VACUUM-read blocks but more the next caller of 
GetFreeBuffer(). In my test case that just happens to be VACUUM most of 
the time. I described it only to demonstrate the existence of potential.

Since the whole point of the buffer cache is to avoid the real bad 
thing, I/O, I don't think that the trivial double-linked list that 
implements it today is adequate.

I can't imagine it completely yet, but what I see vaguely is a cache 
policy that put's a block into the freelist depending on where it was 
coming from (cache, seqscan, indexscan, vacuum) and what it is (heap, 
toast, index). That plus the possibility for vacuum to cause it to be 
written to kernel immediately might do it.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Vacuum thoughts

From
Neil Conway
Date:
On Mon, 2003-10-27 at 15:31, Jan Wieck wrote:
> Well, "partial solution" isn't quite what I would call it, and it surely 
> needs integration with sequential scans. I really do expect the whole 
> hack to fall apart if some concurrent seqscans are going on

I'd rather see us implement a buffer replacement policy that considers
both frequency + recency (unlike LRU, which considers only recency).
Ideally, that would work "automagically". I'm hoping to get a chance to
implement ARC[1] during the 7.5 cycle.

-Neil

[1]: http://citeseer.nj.nec.com/megiddo03arc.html




Re: Vacuum thoughts

From
Bruce Momjian
Date:
Neil Conway wrote:
> On Mon, 2003-10-27 at 15:31, Jan Wieck wrote:
> > Well, "partial solution" isn't quite what I would call it, and it surely 
> > needs integration with sequential scans. I really do expect the whole 
> > hack to fall apart if some concurrent seqscans are going on
> 
> I'd rather see us implement a buffer replacement policy that considers
> both frequency + recency (unlike LRU, which considers only recency).
> Ideally, that would work "automagically". I'm hoping to get a chance to
> implement ARC[1] during the 7.5 cycle.

Someone just started working on it this week.  He emailed Jan and I.  He
hopes to have a patch in a few days.  I will make sure he posts to
hackers/patches.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Vacuum thoughts

From
"Cuong Bui (lokishop.nl)"
Date:
>I'd rather see us implement a buffer replacement policy that considers
>both frequency + recency (unlike LRU, which considers only recency).
>Ideally, that would work "automagically". I'm hoping to get a chance to
>implement ARC[1] during the 7.5 cycle.

Actually i've already done some work back in august. I tried to
implement ARC in pgsql but failed to due to lack of time. After
renewed discussion with Bruce and Jan, i've decided to give it another
try. So hopefully w'll have a patch soon from me or Jan.

The algorithm itself is very elegant and works "automagically".

-- 
Cuong Bui
Loki IT Solutions
http://www.lokishop.nl
Powers your webcommerce!