Re: Plans for solving the VACUUM problem - Mailing list pgsql-hackers

From Philip Warner
Subject Re: Plans for solving the VACUUM problem
Date
Msg-id 3.0.5.32.20010518155226.02edf2e0@mail.rhyme.com.au
Whole thread Raw
In response to Plans for solving the VACUUM problem  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Plans for solving the VACUUM problem  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
At 19:05 17/05/01 -0400, Tom Lane wrote:
>
>But having said that, there's no reason to remove the existing VACUUM
>code: we can keep it around for situations where you need to crunch a
>table as much as possible and you can afford to lock the table while
>you do it.  

It would be great if this was the *only* reason to use the old-style
VACUUM. ie. We should try to avoid a solution that has a VACCUM LAZY in
background and a recommendation to a 'VACUMM PROPERLY' once in a while.


>The new code would be a new command, maybe "VACUUM LAZY"
>(or some other name entirely).

Maybe a name that reflects it's strength/purpose: 'VACUUM
ONLINE/BACKGROUND/NOLOCKS/CONCURRENT' etc.


>Enough handwaving, what about specifics?
>
>1. Forget moving tuples from one page to another.  Doing that in a
>transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
>will only delete dead tuples and coalesce the free space thus made
>available within each page of a relation.

Could this be done opportunistically, meaning it builds up a list of
candidates to move (perhaps based on emptiness of page), then moves a
subset of these in each pass? It's only really useful in the case of a
table that has a high update load then becomes static. Which is not as
unusual as it sounds: people do archive tables by renaming them, then
create a new lean 'current' table. With the new vacuum, the static table
may end up with many half-empty pages that are never reused.


>2. This does no good unless there's a provision to re-use that free space.
>To do that, I propose a free space map (FSM) kept in shared memory, which
>will tell backends which pages of a relation have free space.  Only if the
>FSM shows no free space available will the relation be extended to insert
>a new or updated tuple.

I assume that now is not a good time to bring up memory-mapped files? ;-}


>3. Lazy VACUUM processes a table in five stages:
>   A. Scan relation looking for dead tuples; accumulate a list of their
>      TIDs, as well as info about existing free space.  (This pass is
>      completely read-only and so incurs no WAL traffic.)

Were you planning on just a free byte count, or something smaller? Dec/RDB
uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
where 0=empty, and 1, 2 & 3 indicate above/below thresholds (3 is also
considered 'full'). The thresholds are usually set based on average record
sizes. In this day & age, I suspect a 1 byte percentage, or 2 byte count is
OK unless space is really a premium.


>5. Also note that there's nothing saying that lazy VACUUM must do the
>entire table in one go; once it's accumulated a big enough batch of dead
>tuples, it can proceed through steps B,C,D,E even though it's not scanned
>the whole table.  This avoids a rather nasty problem that VACUUM has
>always had with running out of memory on huge tables.

This sounds great, especially if the same approach could be adopted when/if
moving records.


>Critical point: the FSM is only a hint and does not have to be perfectly
>accurate.  It can omit space that's actually available without harm, and
>if it claims there's more space available on a page than there actually
>is, we haven't lost much except a wasted ReadBuffer cycle.

So long as you store the # of bytes (or %), that should be fine. One of the
horrors of the Dec/RDB system is that with badly set threholds you can
cycle through many pages looking for one that *really* has enough free space.

Also, would the detecting process fix the bad entry?


>  This allows
>us to take shortcuts in maintaining it.  In particular, we can constrain
>the FSM to a prespecified size, which is critical for keeping it in shared
>memory.  We just discard entries (pages or whole relations) as necessary
>to keep it under budget.

Presumably keeping the 'most empty' pages?


>Obviously, we'd not bother to make entries in
>the first place for pages with only a little free space.  Relation entries
>might be discarded on a least-recently-used basis.

You also might want to record some 'average/min/max' record size for the
table to assess when a page's free space is insufficient for the
average/minimum record size.

While on the subject of record keeping, it would be great if it was coded
to collect statistics about it's own operation for Jan's stats package.


>Accesses to the FSM could create contention problems if we're not careful.
>I think this can be dealt with by having each backend remember (in its
>relcache entry for a table) the page number of the last page it chose from
>the FSM to insert into.  That backend will keep inserting new tuples into
>that same page, without touching the FSM, as long as there's room there.
>Only then does it go back to the FSM, update or remove that page entry,
>and choose another page to start inserting on.  This reduces the access
>load on the FSM from once per tuple to once per page.  

This seems to have the potential to create as many false FSM page entries
as there are backends. Is it really that expensive to lock the FSM table
entry, subtract a number, then unlock it? especially when compared to the
cost of adding/updating a whole record.


>(Moreover, we can
>arrange that successive backends consulting the FSM pick different pages
>if possible.  Then, concurrent inserts will tend to go to different pages,
>reducing contention for shared buffers; yet any single backend does
>sequential inserts in one page, so that a bulk load doesn't cause
>disk traffic scattered all over the table.)

This will also increase the natural clustering of the database for SERIAL
fields even though the overall ordering will be all over the place (at
least for insert-intensive apps).


>
>Locking issues
>--------------
>
>We will need two extensions to the lock manager:
>
>1. A new lock type that allows concurrent reads and writes
>(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
>Lazy VACUUM will grab this type of table lock to ensure the table schema
>doesn't change under it.  Call it a VacuumLock until we think of a better
>name.

Is it possible/worth adding a 'blocking notification' to the lock manager.
Then VACUUM could choose to terminate/restart when someone wants to do a
schema change. This is realy only relevant if the VACUUM will be prolonged.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Plans for solving the VACUUM problem
Next
From: Tom Lane
Date:
Subject: Re: Plans for solving the VACUUM problem