Thread: Plans for solving the VACUUM problem

Plans for solving the VACUUM problem

From
Tom Lane
Date:
I have been thinking about the problem of VACUUM and how we might fix it
for 7.2.  Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works.  Instead, here is a sketch of an attack
that I think fits better into the existing system structure.

First point: I don't think we need to get rid of VACUUM, exactly.  What
we want for 24x7 operation is to be able to do whatever housekeeping we
need without locking out normal transaction processing for long intervals.
We could live with routine VACUUMs if they could run in parallel with
reads and writes of the table being vacuumed.  They don't even have to run
in parallel with schema updates of the target table (CREATE/DROP INDEX,
ALTER TABLE, etc).  Schema updates aren't things you do lightly for big
tables anyhow.  So what we want is more of a "background VACUUM" than a
"no VACUUM" solution.

Second: if VACUUM can run in the background, then there's no reason not
to run it fairly frequently.  In fact, it could become an automatically
scheduled activity like CHECKPOINT is now, or perhaps even a continuously
running daemon (which was the original conception of it at Berkeley, BTW).
This is important because it means that VACUUM doesn't have to be perfect.
The existing VACUUM code goes to huge lengths to ensure that it compacts
the table as much as possible.  We don't need that; if we miss some free
space this time around, but we can expect to get it the next time (or
eventually), we can be happy.  This leads to thinking of space management
in terms of steady-state behavior, rather than the periodic "big bang"
approach that VACUUM represents now.

But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.  The new code would be a new command, maybe "VACUUM LAZY"
(or some other name entirely).

Enough handwaving, what about specifics?

1. Forget moving tuples from one page to another.  Doing that in a
transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.

2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space.  Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.

3. Lazy VACUUM processes a table in five stages:  A. Scan relation looking for dead tuples; accumulate a list of their
  TIDs, as well as info about existing free space.  (This pass is     completely read-only and so incurs no WAL
traffic.) B. Remove index entries for the dead tuples.  (See below for details.)  C. Physically delete dead tuples and
compactfree space on their pages.  D. Truncate any completely-empty pages at relation's end.  (Optional,     see
below.) E. Create/update FSM entry for the table.
 
Note that this is crash-safe as long as the individual update operations
are atomic (which can be guaranteed by WAL entries for them).  If a tuple
is dead, we care not whether its index entries are still around or not;
so there's no risk to logical consistency.

4. Observe that lazy VACUUM need not really be a transaction at all, since
there's nothing it does that needs to be cancelled or undone if it is
aborted.  This means that its WAL entries do not have to hang around past
the next checkpoint, which solves the huge-WAL-space-usage problem that
people have noticed while VACUUMing large tables under 7.1.

5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table.  This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.


Free space map details
----------------------

I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.

The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table.  Backends then decrement/remove page entries as they
use free space.

Critical point: the FSM is only a hint and does not have to be perfectly
accurate.  It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.  This allows
us to take shortcuts in maintaining it.  In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory.  We just discard entries (pages or whole relations) as necessary
to keep it under budget.  Obviously, we'd not bother to make entries in
the first place for pages with only a little free space.  Relation entries
might be discarded on a least-recently-used basis.

Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into.  That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on.  This reduces the access
load on the FSM from once per tuple to once per page.  (Moreover, we can
arrange that successive backends consulting the FSM pick different pages
if possible.  Then, concurrent inserts will tend to go to different pages,
reducing contention for shared buffers; yet any single backend does
sequential inserts in one page, so that a bulk load doesn't cause
disk traffic scattered all over the table.)

The FSM can also cache the overall relation size, saving an lseek kernel
call whenever we do have to extend the relation for lack of internal free
space.  This will help pay for the locking cost of accessing the FSM.


Locking issues
--------------

We will need two extensions to the lock manager:

1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
Lazy VACUUM will grab this type of table lock to ensure the table schema
doesn't change under it.  Call it a VacuumLock until we think of a better
name.

2. A "conditional lock" operation that acquires a lock if available, but
doesn't block if not.

The conditional lock will be used by lazy VACUUM to try to upgrade its
VacuumLock to an AccessExclusiveLock at step D (truncate table).  If it's
able to get exclusive lock, it's safe to truncate any unused end pages.
Without exclusive lock, it's not, since there might be concurrent
transactions scanning or inserting into the empty pages.  We do not want
lazy VACUUM to block waiting to do this, since if it does that it will
create a lockout situation (reader/writer transactions will stack up
behind it in the lock queue while everyone waits for the existing
reader/writer transactions to finish).  Better to not do the truncation.

Another place where lazy VACUUM may be unable to do its job completely
is in compaction of space on individual disk pages.  It can physically
move tuples to perform compaction only if there are not currently any
other backends with pointers into that page (which can be tested by
looking to see if the buffer reference count is one).  Again, we punt
and leave the space to be compacted next time if we can't do it right
away.

The fact that inserted/updated tuples might wind up anywhere in the table,
not only at the end, creates no headaches except for heap_update.  That
routine needs buffer locks on both the page containing the old tuple and
the page that will contain the new.  To avoid possible deadlocks between
different backends locking the same two pages in opposite orders, we need
to constrain the lock ordering used by heap_update.  This is doable but
will require slightly more code than is there now.


Index access method improvements
--------------------------------

Presently, VACUUM deletes index tuples by doing a standard index scan
and checking each returned index tuple to see if it points at any of
the tuples to be deleted.  If so, the index AM is called back to delete
the tested index tuple.  This is horribly inefficient: it means one trip
into the index AM (with associated buffer lock/unlock and search overhead)
for each tuple in the index, plus another such trip for each tuple actually
deleted.

This is mainly a problem of a poorly chosen API.  The index AMs should
offer a "bulk delete" call, which is passed a sorted array of main-table
TIDs.  The loop over the index tuples should happen internally to the
index AM.  At least in the case of btree, this could be done by a
sequential scan over the index pages, which avoids the random I/O of an
index-order scan and so should offer additional speedup.

Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or perhaps
via a separate "vacuum index" API.  This can be done without exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees paper
didn't describe how, but more recent papers show how to do it.  As with
the main tables, I think it's sufficient to recycle freed space within
the index, and not necessarily try to give it back to the OS.

We will also want to look at upgrading the non-btree index types to allow
concurrent operations.  This may be a research problem; I don't expect to
touch that issue for 7.2.  (Hence, lazy VACUUM on tables with non-btree
indexes will still create lockouts until this is addressed.  But note that
the lockout only lasts through step B of the VACUUM, not the whole thing.)


There you have it.  If people like this, I'm prepared to commit to
making it happen for 7.2.  Comments, objections, better ideas?
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

[...]

> There you have it.  If people like this, I'm prepared to commit to
> making it happen for 7.2.  Comments, objections, better ideas?

I'm just commenting from the peanut gallery, but it looks really
well-thought-out to me.  I like the general "fast, good enough, not
necessarily perfect" approach you've taken.

I'd be happy to help test and debug once things get going.

-Doug
-- 
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time...          --Dylan


RE: Plans for solving the VACUUM problem

From
Mike Mascari
Date:
Very neat. You mention that the truncation of both heap and index 
relations is not necessarily mandatory. Under what conditions would 
either of them be truncated?

Mike Mascari
mascarm@mascari.com

-----Original Message-----
From:    Tom Lane [SMTP:tgl@sss.pgh.pa.us]

....

3. Lazy VACUUM processes a table in five stages:  A. Scan relation looking for dead tuples; accumulate a list of 
their     TIDs, as well as info about existing free space.  (This pass is     completely read-only and so incurs no WAL
traffic.) B. Remove index entries for the dead tuples.  (See below for 
 
details.)  C. Physically delete dead tuples and compact free space on their 
pages.  D. Truncate any completely-empty pages at relation's end. (Optional,     see below.)  E. Create/update FSM
entryfor the table.
 

...

Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or 
perhaps
via a separate "vacuum index" API.  This can be done without 
exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees 
paper
didn't describe how, but more recent papers show how to do it.  As 
with
the main tables, I think it's sufficient to recycle freed space 
within
the index, and not necessarily try to give it back to the OS.

...        regards, tom lane



Re: Plans for solving the VACUUM problem

From
mlw
Date:
Tom Lane wrote:
> 
I love it all. 
I agree that vacuum should be an optional function that really packs tables.

I also like the idea of a vacuum that runs in the background and does not too
badly affect operation.

My only suggestion would be to store some information in the statistics about
whether or not, and how bad, a table needs to be vacuumed. In a scheduled
background environment, the tables that need it most should get it most often.
Often times many tables never need to be vacuumed.

Also, it would be good to be able to update the statistics without doing a
vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
without a vacuum.


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Free space map details
> ----------------------
> 
> I envision the FSM as a shared hash table keyed by table ID, with each
> entry containing a list of page numbers and free space in each such page.
> 
> The FSM is empty at system startup and is filled by lazy VACUUM as it
> processes each table.  Backends then decrement/remove page entries as they
> use free space.
> 
> Critical point: the FSM is only a hint and does not have to be perfectly
> accurate.  It can omit space that's actually available without harm, and
> if it claims there's more space available on a page than there actually
> is, we haven't lost much except a wasted ReadBuffer cycle.  This allows
> us to take shortcuts in maintaining it.  In particular, we can constrain
> the FSM to a prespecified size, which is critical for keeping it in shared
> memory.  We just discard entries (pages or whole relations) as necessary
> to keep it under budget.  Obviously, we'd not bother to make entries in
> the first place for pages with only a little free space.  Relation entries
> might be discarded on a least-recently-used basis.

The only question I have is about the Free Space Map.  It would seem
better to me if we could get this map closer to the table itself, rather
than having every table of every database mixed into the same shared
memory area.  I can just see random table access clearing out most of
the map cache and perhaps making it less useless.

It would be nice if we could store the map on the first page of the disk
table, or store it in a flat file per table.  I know both of these ideas
will not work, but I am just throwing it out to see if someone has a
better idea.  

I wonder if cache failures should be what drives the vacuum daemon to
vacuum a table?  Sort of like, "Hey, someone is asking for free pages
for that table.  Let's go find some!"  That may work really well. 
Another advantage of centralization is that we can record update/delete
counters per table, helping tell vacuum where to vacuum next.  Vacuum
roaming around looking for old tuples seems wasteful.

Also, I suppose if we have the map act as a shared table cache (fseek
info), it may override the disadvantage of having it all centralized.

I know I am throwing out the advantages and disadvantages of
centralization, but I thought I would give out the ideas.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
"August Zajonc"
Date:
Heck ya...

> I wonder if cache failures should be what drives the vacuum daemon to
> vacuum a table?  Sort of like, "Hey, someone is asking for free pages
> for that table.  Let's go find some!"  That may work really well.
> Another advantage of centralization is that we can record update/delete
> counters per table, helping tell vacuum where to vacuum next.  Vacuum
> roaming around looking for old tuples seems wasteful.

Counters seem like a nice addition. For example, access patterns to session
tables are almost pure UPDATE/DELETEs and a ton of them. On the other hand,
log type tables see no UPDATE/DELETE but tend to be huge in comparision. I
suspect many applications outside ours will show large disparties in the
"Vacuumability" score for different tables.

Quick question:
Using lazy vacuum, if I have two identical (at the file level) copies of a
database, run the same queries against them for a few days, then shut them
down again, are the copies still identical? Is this different than the
current behavior (ie, queries, full vacuum)?


AZ




Re: Re: Plans for solving the VACUUM problem

From
"Matthew T. O'Connor"
Date:
> Also, it would be good to be able to update the statistics without doing a
> vacuum, i.e. rather than having to vacuum to analyze, being able to
analyze
> without a vacuum.
>
I was going to ask the same thing.  In a lot of situations (insert
dominated) vacumm analyze is more important for the update of statistics
then for recovery of space.  Could we roll that into this?  Or should we
also have an Analyze daemon?



Re: Plans for solving the VACUUM problem

From
"Matthew T. O'Connor"
Date:
> Free space map details
> ----------------------
>
> Accesses to the FSM could create contention problems if we're not careful.

Another quick thought for handling FSM contention problems.  A backend could
give up waiting for access to the FSM after a short period of time, and just
append it's data to the end of the file the same way it's done now.  Dunno
if that is feasable but it seemed like an idea to me.

Other than that, I would just like to say this will be a great improvement
for pgsql.  Tom, you and several other on this list continue to impress the
hell out of me.



Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> My only suggestion would be to store some information in the statistics about
> whether or not, and how bad, a table needs to be vacuumed.

I was toying with the notion of using the FSM to derive that info,
somewhat indirectly to be sure (since what the FSM could tell you would
be about tuples inserted not tuples deleted).  Heavily used FSM entries
would be the vacuum daemon's cues for tables to hit more often.

ANALYZE stats don't seem like a productive way to attack this, since
there's no guarantee they'd be updated often enough.  If the overall
data distribution of a table isn't changing much, there's no need to
analyze it often.  

> Also, it would be good to be able to update the statistics without doing a
> vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
> without a vacuum.

Irrelevant, not to mention already done ...
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Tim Allen
Date:
On Thu, 17 May 2001, Tom Lane wrote:

> I have been thinking about the problem of VACUUM and how we might fix it
> for 7.2.  Vadim has suggested that we should attack this by implementing
> an overwriting storage manager and transaction UNDO, but I'm not totally
> comfortable with that approach: it seems to me that it's an awfully large
> change in the way Postgres works.  Instead, here is a sketch of an attack
> that I think fits better into the existing system structure.

<snip>

> 

My AUD0.02, FWIW, is this sounds great. You said you were only planning to
concentrate on performance enhancements, not new features, Tom, but IMHO
this is a new feature and a good one :).

As several others have mentioned, automatic analyze would also be nice. I
gather the backend already has the ability to treat analyze as a separate
process, so presumably this is a completely separate issue from automatic
vacuum. Some sort of background daemon or whatever would be good. And
again, one could take the approach that it doesn't have to get it 100%
right, at least in the short term; as long as it is continually
incrementing itself in the direction of accurate statistics, then that's
much better than the current situation. Presumably one could also retain
the option of doing an explicit analyze occasionally, if you have
processor cycles to burn and are really keen to get the stats correct in
a hurry.

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Mike Mascari <mascarm@mascari.com> writes:
> Very neat. You mention that the truncation of both heap and index 
> relations is not necessarily mandatory. Under what conditions would 
> either of them be truncated?

In the proposal as given, a heap file would be truncated if (a) it
has at least one totally empty block at the end, and (b) no other
transaction is touching the table at the instant that VACUUM is
ready to truncate it.

This would probably be fairly infrequently true, especially for
heavily used tables, but if you believe in a "steady state" analysis
then that's just fine.  No point in handing blocks back to the OS
only to have to allocate them again soon.

We might want to try to tilt the FSM-driven reuse of freed space
to favor space near the start of the file and avoid end blocks.
Without that, you might never get totally free blocks at the end.

The same comments hold for index blocks, with the additional problem
that the index structure would make it almost impossible to drive usage
away from the physical end-of-file.  For btrees I think it'd be
sufficient if we could recycle empty blocks for use elsewhere in the
btree structure.  Actually shrinking the index probably won't happen
short of a REINDEX.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> I have been thinking about the problem of VACUUM and how we might fix it
> for 7.2.  Vadim has suggested that we should attack this by implementing
> an overwriting storage manager and transaction UNDO, but I'm not totally
> comfortable with that approach: 

IIRC, Vadim doesn't intend to implement an overwriting
smgr at least in the near future though he seems to 
have a plan to implement a functionality to allow space
re-use without vacuum as marked in TODO. IMHO UNDO
functionality under no overwriting(ISTM much easier
than under overwriting) smgr has the highest priority.
Savepoints is planned for 7.0 but we don't have it yet.

[snip]

> 
> Locking issues
> --------------
> 
> We will need two extensions to the lock manager:
> 
> 1. A new lock type that allows concurrent reads and writes
> (AccessShareLock, RowShareLock, RowExclusiveLock) but not
> anything else.

What's different from RowExclusiveLock ?
Does it conflict with itself ?

> 
> The conditional lock will be used by lazy VACUUM to try to upgrade its
> VacuumLock to an AccessExclusiveLock at step D (truncate table).  If it's
> able to get exclusive lock, it's safe to truncate any unused end pages.
> Without exclusive lock, it's not, since there might be concurrent
> transactions scanning or inserting into the empty pages.  We do not want
> lazy VACUUM to block waiting to do this, since if it does that it will
> create a lockout situation (reader/writer transactions will stack up
> behind it in the lock queue while everyone waits for the existing
> reader/writer transactions to finish).  Better to not do the truncation.
> 

And would the truncation occur that often in reality under
the scheme(without tuple movement) ?

[snip]

> 
> Index access method improvements
> --------------------------------
> 
> Presently, VACUUM deletes index tuples by doing a standard index scan
> and checking each returned index tuple to see if it points at any of
> the tuples to be deleted.  If so, the index AM is called back to delete
> the tested index tuple.  This is horribly inefficient: it means one trip
> into the index AM (with associated buffer lock/unlock and search overhead)
> for each tuple in the index, plus another such trip for each tuple actually
> deleted.
> 
> This is mainly a problem of a poorly chosen API.  The index AMs should
> offer a "bulk delete" call, which is passed a sorted array of main-table
> TIDs.  The loop over the index tuples should happen internally to the
> index AM.  At least in the case of btree, this could be done by a
> sequential scan over the index pages, which avoids the random I/O of an
> index-order scan and so should offer additional speedup.
> 

???? Isn't current implementation "bulk delete" ?
Fast access to individual index tuples seems to be
also needed in case a few dead tuples.

> Further out (possibly not for 7.2), we should also look at making the
> index AMs responsible for shrinking indexes during deletion, or perhaps
> via a separate "vacuum index" API.  This can be done without exclusive
> locks on the index --- the original Lehman & Yao concurrent-btrees paper
> didn't describe how, but more recent papers show how to do it.  As with
> the main tables, I think it's sufficient to recycle freed space within
> the index, and not necessarily try to give it back to the OS.
> 

Great. There would be few disadvantages in our btree
implementation.

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> Another quick thought for handling FSM contention problems.  A backend could
> give up waiting for access to the FSM after a short period of time, and just
> append it's data to the end of the file the same way it's done now.  Dunno
> if that is feasable but it seemed like an idea to me.

Mmm ... maybe, but I doubt it'd help much.  Appending a page to the file
requires grabbing some kind of lock anyway (since you can't have two
backends doing it at the same instant).  With any luck, that locking can
be merged with the locking involved in accessing the FSM.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> The only question I have is about the Free Space Map.  It would seem
> better to me if we could get this map closer to the table itself, rather
> than having every table of every database mixed into the same shared
> memory area.  I can just see random table access clearing out most of
> the map cache and perhaps making it less useless.

What random access?  Read transactions will never touch the FSM at all.
As for writes, seems to me the places you are writing are exactly the
places you need info for.

You make a good point, which is that we don't want a schedule-driven
VACUUM to load FSM entries for unused tables into the map at the cost
of throwing out entries that *are* being used.  But it seems to me that
that's easily dealt with if we recognize the risk.

> It would be nice if we could store the map on the first page of the disk
> table, or store it in a flat file per table.  I know both of these ideas
> will not work,

You said it.  What's wrong with shared memory?  You can't get any closer
than shared memory: keeping maps in the files would mean you'd need to
chew up shared-buffer space to get at them.  (And what was that about
random accesses causing your maps to get dropped?  That would happen
for sure if they live in shared buffers.)

Another problem with keeping stuff in the first page: what happens when
the table gets big enough that 8k of map data isn't really enough?
With a shared-memory area, we can fairly easily allocate a variable
amount of space based on total size of a relation vs. total size of
relations under management.

It is true that a shared-memory map would be useless at system startup,
until VACUUM has run and filled in some info.  But I don't see that as
a big drawback.  People who aren't developers like us don't restart
their postmasters every five minutes.

> Another advantage of centralization is that we can record update/delete
> counters per table, helping tell vacuum where to vacuum next.  Vacuum
> roaming around looking for old tuples seems wasteful.

Indeed.  But I thought you were arguing against centralization?
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Philip Warner
Date:
At 19:05 17/05/01 -0400, Tom Lane wrote:
>
>But having said that, there's no reason to remove the existing VACUUM
>code: we can keep it around for situations where you need to crunch a
>table as much as possible and you can afford to lock the table while
>you do it.  

It would be great if this was the *only* reason to use the old-style
VACUUM. ie. We should try to avoid a solution that has a VACCUM LAZY in
background and a recommendation to a 'VACUMM PROPERLY' once in a while.


>The new code would be a new command, maybe "VACUUM LAZY"
>(or some other name entirely).

Maybe a name that reflects it's strength/purpose: 'VACUUM
ONLINE/BACKGROUND/NOLOCKS/CONCURRENT' etc.


>Enough handwaving, what about specifics?
>
>1. Forget moving tuples from one page to another.  Doing that in a
>transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
>will only delete dead tuples and coalesce the free space thus made
>available within each page of a relation.

Could this be done opportunistically, meaning it builds up a list of
candidates to move (perhaps based on emptiness of page), then moves a
subset of these in each pass? It's only really useful in the case of a
table that has a high update load then becomes static. Which is not as
unusual as it sounds: people do archive tables by renaming them, then
create a new lean 'current' table. With the new vacuum, the static table
may end up with many half-empty pages that are never reused.


>2. This does no good unless there's a provision to re-use that free space.
>To do that, I propose a free space map (FSM) kept in shared memory, which
>will tell backends which pages of a relation have free space.  Only if the
>FSM shows no free space available will the relation be extended to insert
>a new or updated tuple.

I assume that now is not a good time to bring up memory-mapped files? ;-}


>3. Lazy VACUUM processes a table in five stages:
>   A. Scan relation looking for dead tuples; accumulate a list of their
>      TIDs, as well as info about existing free space.  (This pass is
>      completely read-only and so incurs no WAL traffic.)

Were you planning on just a free byte count, or something smaller? Dec/RDB
uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
where 0=empty, and 1, 2 & 3 indicate above/below thresholds (3 is also
considered 'full'). The thresholds are usually set based on average record
sizes. In this day & age, I suspect a 1 byte percentage, or 2 byte count is
OK unless space is really a premium.


>5. Also note that there's nothing saying that lazy VACUUM must do the
>entire table in one go; once it's accumulated a big enough batch of dead
>tuples, it can proceed through steps B,C,D,E even though it's not scanned
>the whole table.  This avoids a rather nasty problem that VACUUM has
>always had with running out of memory on huge tables.

This sounds great, especially if the same approach could be adopted when/if
moving records.


>Critical point: the FSM is only a hint and does not have to be perfectly
>accurate.  It can omit space that's actually available without harm, and
>if it claims there's more space available on a page than there actually
>is, we haven't lost much except a wasted ReadBuffer cycle.

So long as you store the # of bytes (or %), that should be fine. One of the
horrors of the Dec/RDB system is that with badly set threholds you can
cycle through many pages looking for one that *really* has enough free space.

Also, would the detecting process fix the bad entry?


>  This allows
>us to take shortcuts in maintaining it.  In particular, we can constrain
>the FSM to a prespecified size, which is critical for keeping it in shared
>memory.  We just discard entries (pages or whole relations) as necessary
>to keep it under budget.

Presumably keeping the 'most empty' pages?


>Obviously, we'd not bother to make entries in
>the first place for pages with only a little free space.  Relation entries
>might be discarded on a least-recently-used basis.

You also might want to record some 'average/min/max' record size for the
table to assess when a page's free space is insufficient for the
average/minimum record size.

While on the subject of record keeping, it would be great if it was coded
to collect statistics about it's own operation for Jan's stats package.


>Accesses to the FSM could create contention problems if we're not careful.
>I think this can be dealt with by having each backend remember (in its
>relcache entry for a table) the page number of the last page it chose from
>the FSM to insert into.  That backend will keep inserting new tuples into
>that same page, without touching the FSM, as long as there's room there.
>Only then does it go back to the FSM, update or remove that page entry,
>and choose another page to start inserting on.  This reduces the access
>load on the FSM from once per tuple to once per page.  

This seems to have the potential to create as many false FSM page entries
as there are backends. Is it really that expensive to lock the FSM table
entry, subtract a number, then unlock it? especially when compared to the
cost of adding/updating a whole record.


>(Moreover, we can
>arrange that successive backends consulting the FSM pick different pages
>if possible.  Then, concurrent inserts will tend to go to different pages,
>reducing contention for shared buffers; yet any single backend does
>sequential inserts in one page, so that a bulk load doesn't cause
>disk traffic scattered all over the table.)

This will also increase the natural clustering of the database for SERIAL
fields even though the overall ordering will be all over the place (at
least for insert-intensive apps).


>
>Locking issues
>--------------
>
>We will need two extensions to the lock manager:
>
>1. A new lock type that allows concurrent reads and writes
>(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
>Lazy VACUUM will grab this type of table lock to ensure the table schema
>doesn't change under it.  Call it a VacuumLock until we think of a better
>name.

Is it possible/worth adding a 'blocking notification' to the lock manager.
Then VACUUM could choose to terminate/restart when someone wants to do a
schema change. This is realy only relevant if the VACUUM will be prolonged.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Tom Lane wrote:
>> 1. A new lock type that allows concurrent reads and writes
>> (AccessShareLock, RowShareLock, RowExclusiveLock) but not
>> anything else.

> What's different from RowExclusiveLock ?

I wanted something that *does* conflict with itself, thereby ensuring
that only one instance of VACUUM can be running on a table at a time.
This is not absolutely necessary, perhaps, but without it matters would
be more complex for little gain that I can see.  (For example, the tuple
TIDs that lazy VACUUM gathers in step A might already be deleted and
compacted out of existence by another VACUUM, and then reused as new
tuples, before the first VACUUM gets back to the page to delete them.
There would need to be a defense against that if we allow concurrent
VACUUMs.)

> And would the truncation occur that often in reality under
> the scheme(without tuple movement) ?

Probably not, per my comments to someone else.  I'm not very concerned
about that, as long as we are able to recycle freed space within the
relation.

We could in fact move tuples if we wanted to --- it's not fundamentally
different from an UPDATE --- but then VACUUM becomes a transaction and
we have the WAL-log-traffic problem back again.  I'd like to try it
without that and see if it gets the job done.

> ???? Isn't current implementation "bulk delete" ?

No, the index AM is called separately for each index tuple to be
deleted; more to the point, the search for deletable index tuples
should be moved inside the index AM for performance reasons.

> Fast access to individual index tuples seems to be
> also needed in case a few dead tuples.

Yes, that is the approach Vadim used in the LAZY VACUUM code he did
last summer for Alfred Perlstein.  I had a couple of reasons for not
duplicating that method:

1. Searching for individual index tuples requires hanging onto (or
refetching) the index key data for each target tuple.  A bulk delete
based only on tuple TIDs is simpler and requires little memory.

2. The main reason for that form of lazy VACUUM is to minimize the time
spent holding the exclusive lock on a table.  With this approach we
don't need to worry so much about that.  A background task trawling
through an index won't really bother anyone, since it won't lock out
updates.

3. If we're concerned about the speed of lazy VACUUM when there's few
tuples to be recycled, there's a really easy answer: don't recycle 'em.
Leave 'em be until there are enough to make it worth going after 'em.
Remember this is a "fast and good enough" approach, not a "get every
possible byte every time" approach.

Using a key-driven delete when we have only a few deletable tuples might
be a useful improvement later on, but I don't think it's necessary to
bother with it in the first go-round.  This is a big project already...
        regards, tom lane


Re: Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > Also, it would be good to be able to update the statistics without doing a
> > vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
> > without a vacuum.
> 
> Irrelevant, not to mention already done ...

Do you mean that we already can do just analyze ?

What is the syntax ?

----------------
Hannu


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> At 19:05 17/05/01 -0400, Tom Lane wrote:
>> 1. Forget moving tuples from one page to another.

> Could this be done opportunistically, meaning it builds up a list of
> candidates to move (perhaps based on emptiness of page), then moves a
> subset of these in each pass?

Well, if we move tuples at all then we have a considerably different
animal: to move tuples across pages you must be a transaction so that
you can have an atomic commit for both pages, and that brings back the
issue of how long the transaction runs for and how large its WAL trail
will grow before it can be dropped.  Yeah, you could move a limited
number of tuples, commit, and start again ... but it's not so
lightweight anymore.

Perhaps we will eventually end up with three strengths of VACUUM:
the existing heavy-duty form, the "lazy" form that isn't transactional,
and an intermediate form that is willing to move tuples in simpler
cases (none of that tuple-chain-moving stuff please ;-)).  But I'm
not buying into producing the intermediate form in this go-round.
Let's build the lazy form first and get some experience with it
before we decide if we need yet another kind of VACUUM.

>> To do that, I propose a free space map (FSM) kept in shared memory, which
>> will tell backends which pages of a relation have free space.  Only if the
>> FSM shows no free space available will the relation be extended to insert
>> a new or updated tuple.

> I assume that now is not a good time to bring up memory-mapped files? ;-}

Don't see the relevance exactly ...

> Were you planning on just a free byte count, or something smaller? Dec/RDB
> uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
> where 0=empty, and 1, 2 & 3 indicate above/below thresholds (3 is also
> considered 'full'). The thresholds are usually set based on average record
> sizes. In this day & age, I suspect a 1 byte percentage, or 2 byte count is
> OK unless space is really a premium.

I had toyed with two different representations of the FSM:

1. Bitmap: one bit per page in the relation, set if there's an
"interesting" amount of free space in the page (exact threshold ???).
DEC's approach seems to be a generalization of this.

2. Page list: page number and number of free bytes.  This is six bytes
per page represented; you could maybe compress it to 5 but I'm not sure
there's much point.

I went with #2 mainly because it adapts easily to being forced into a
limited amount of space (just forget the pages with least free space)
which is critical for a table to be kept in shared memory.  A bitmap
would be less forgiving.  #2 also gives you a better chance of going
to a page that actually has enough free space for your tuple, though
you'd still need to be prepared to find out that it no longer has
enough once you get to it.  (Whereupon, you go back to the FSM, fix
the outdated entry, and keep searching.)

> While on the subject of record keeping, it would be great if it was coded
> to collect statistics about it's own operation for Jan's stats package.

Seems like a good idea, but I've seen no details yet about that package...

> This seems to have the potential to create as many false FSM page entries
> as there are backends. Is it really that expensive to lock the FSM table
> entry, subtract a number, then unlock it?

Yes, if you are contending against N other backends to get that lock.
Remember the *whole* point of this design is to avoid locking as much
as possible.  Too many trips to the FSM could throw away the performance
advantage.

> Is it possible/worth adding a 'blocking notification' to the lock manager.
> Then VACUUM could choose to terminate/restart when someone wants to do a
> schema change. This is realy only relevant if the VACUUM will be prolonged.

Seems messier than it's worth ... the VACUUM might not be the only thing
holding off your schema update anyway, and regular transactions aren't
likely to pay any attention.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
>> 
> Enough handwaving, what about specifics?
> 
> 1. Forget moving tuples from one page to another.  Doing that in a
> transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
> will only delete dead tuples and coalesce the free space thus made
> available within each page of a relation.

If we really need to move a tuple we can do it by a regular update that 
SET-s nothing and just copies the tuple to another page inside a
separate 
transaction.

-------------------
Hannu


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > Tom Lane wrote:
> 
> > And would the truncation occur that often in reality under
> > the scheme(without tuple movement) ?
> 
> Probably not, per my comments to someone else.  I'm not very concerned
> about that, as long as we are able to recycle freed space within the
> relation.
> 

Agreed.

> We could in fact move tuples if we wanted to --- it's not fundamentally
> different from an UPDATE --- but then VACUUM becomes a transaction and
> we have the WAL-log-traffic problem back again.

And it has been always the cause of bugs and innefficiency
of VACUUM IMHO.

regards,
Hiroshi Inoue


Re: Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
>> Irrelevant, not to mention already done ...

> Do you mean that we already can do just analyze ?

In development sources, yes.  See

http://www.ca.postgresql.org/devel-corner/docs/postgres/sql-analyze.html
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Oleg Bartunov
Date:
On Thu, 17 May 2001, Tom Lane wrote:

>
> We will also want to look at upgrading the non-btree index types to allow
> concurrent operations.  This may be a research problem; I don't expect to
> touch that issue for 7.2.  (Hence, lazy VACUUM on tables with non-btree
> indexes will still create lockouts until this is addressed.  But note that
> the lockout only lasts through step B of the VACUUM, not the whole thing.)

am I right you plan to work with GiST indexes as well ?
We read a paper "Concurrency and Recovery in Generalized Search Trees"
by Marcel Kornacker, C. Mohan, Joseph Hellerstein
(http://citeseer.nj.nec.com/kornacker97concurrency.html)
and probably we could go in this direction. Right now we're working
on adding of multi-key support to GiST.

btw, I have a question about function gistPageAddItem in gist.c
it just decompress - compress key and calls PageAddItem to
write tuple. We don't understand why do we need this function -
why not use PageAddItem function. Adding multi-key support requires
a lot of work and we don't want to waste our efforts and time.
We already done some tests (gistPageAddItem -> PageAddItem) and
everything is ok.  Bruce, you're enthuasistic in removing unused  code :-)



>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Thu, 17 May 2001, Tom Lane wrote:
>> We will also want to look at upgrading the non-btree index types to allow
>> concurrent operations.

> am I right you plan to work with GiST indexes as well ?
> We read a paper "Concurrency and Recovery in Generalized Search Trees"
> by Marcel Kornacker, C. Mohan, Joseph Hellerstein
> (http://citeseer.nj.nec.com/kornacker97concurrency.html)
> and probably we could go in this direction. Right now we're working
> on adding of multi-key support to GiST.

Yes, GIST should be upgraded to do concurrency.  But I have no objection
if you want to work on multi-key support first.

My feeling is that a few releases from now we will have btree and GIST
as the preferred/well-supported index types.  Hash and rtree might go
away altogether --- AFAICS they don't do anything that's not done as
well or better by btree or GIST, so what's the point of maintaining
them?

> btw, I have a question about function gistPageAddItem in gist.c
> it just decompress - compress key and calls PageAddItem to
> write tuple. We don't understand why do we need this function -

The comment says

** Take a compressed entry, and install it on a page.  Since we now know
** where the entry will live, we decompress it and recompress it using
** that knowledge (some compression routines may want to fish around
** on the page, for example, or do something special for leaf nodes.)

Are you prepared to say that you will no longer support the ability for
GIST compression routines to do those things?  That seems shortsighted.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > On Thu, 17 May 2001, Tom Lane wrote:
> >> We will also want to look at upgrading the non-btree index types to allow
> >> concurrent operations.
> 
> > am I right you plan to work with GiST indexes as well ?
> > We read a paper "Concurrency and Recovery in Generalized Search Trees"
> > by Marcel Kornacker, C. Mohan, Joseph Hellerstein
> > (http://citeseer.nj.nec.com/kornacker97concurrency.html)
> > and probably we could go in this direction. Right now we're working
> > on adding of multi-key support to GiST.
> 
> Yes, GIST should be upgraded to do concurrency.  But I have no objection
> if you want to work on multi-key support first.
> 
> My feeling is that a few releases from now we will have btree and GIST
> as the preferred/well-supported index types.  Hash and rtree might go
> away altogether --- AFAICS they don't do anything that's not done as
> well or better by btree or GIST, so what's the point of maintaining
> them?

We clearly have too many index types, and we are actively telling people
not to use hash.  And Oleg, don't you have a better version of GIST rtree
than our native rtree?

I certainly like streamlining our stuff.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Oleg Bartunov
Date:
On Fri, 18 May 2001, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
> > On Thu, 17 May 2001, Tom Lane wrote:
> >> We will also want to look at upgrading the non-btree index types to allow
> >> concurrent operations.
>
> > am I right you plan to work with GiST indexes as well ?
> > We read a paper "Concurrency and Recovery in Generalized Search Trees"
> > by Marcel Kornacker, C. Mohan, Joseph Hellerstein
> > (http://citeseer.nj.nec.com/kornacker97concurrency.html)
> > and probably we could go in this direction. Right now we're working
> > on adding of multi-key support to GiST.

Another paper to read:
"Efficient Concurrency Control in Multidimensional Access Methods"
by Kaushik Chakrabarti
http://www.ics.uci.edu/~kaushik/research/pubs.html

>
> Yes, GIST should be upgraded to do concurrency.  But I have no objection
> if you want to work on multi-key support first.
>
> My feeling is that a few releases from now we will have btree and GIST
> as the preferred/well-supported index types.  Hash and rtree might go
> away altogether --- AFAICS they don't do anything that's not done as
> well or better by btree or GIST, so what's the point of maintaining
> them?

Cool ! We could write rtree (and btree) ops using GiST. We have already
realization of rtree for box ops and there are no problem to write
additional ops for points, polygons etc.

>
> > btw, I have a question about function gistPageAddItem in gist.c
> > it just decompress - compress key and calls PageAddItem to
> > write tuple. We don't understand why do we need this function -
>
> The comment says
>
> ** Take a compressed entry, and install it on a page.  Since we now know
> ** where the entry will live, we decompress it and recompress it using
> ** that knowledge (some compression routines may want to fish around
> ** on the page, for example, or do something special for leaf nodes.)
>
> Are you prepared to say that you will no longer support the ability for
> GIST compression routines to do those things?  That seems shortsighted.
>

No-no !!! we don't intend to lose that (compression) functionality.

there are several reason we want to eliminate gistPageAddItem:
1. It seems there are no examples where compress uses information about  the page.
2. There is some discrepancy between calculation of free space on page and  the size of tuple saved on page -
calculationof free space on page  by gistNoSpace uses compressed tuple but tuple itself saved after  recompression.
It'spossible that size of tupple could changed  after recompression.
 
3. decompress/compress could slowdown insert  because it happens  for every tuple.
4. Currently gistPageAddItem is broken because it's not toast safe  (see call gist_tuple_replacekey in
gistPageAddItem)

Right now we use  #define GIST_PAGEADDITEM in gist.c and
working with original PageAddItem. If people insist on gistPageAddItem
we'll totally rewrite it. But for now we have enough job to do.


>             regards, tom lane
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: Plans for solving the VACUUM problem

From
Oleg Bartunov
Date:
On Fri, 18 May 2001, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
> >> The comment says
> >>
> >> ** Take a compressed entry, and install it on a page.  Since we now know
> >> ** where the entry will live, we decompress it and recompress it using
> >> ** that knowledge (some compression routines may want to fish around
> >> ** on the page, for example, or do something special for leaf nodes.)
> >>
> >> Are you prepared to say that you will no longer support the ability for
> >> GIST compression routines to do those things?  That seems shortsighted.
>
> > No-no !!! we don't intend to lose that (compression) functionality.
>
> > there are several reason we want to eliminate gistPageAddItem:
> > 1. It seems there are no examples where compress uses information about
> >    the page.
>
> We have none now, perhaps, but the original GIST authors seemed to think
> it would be a useful capability.  I'm hesitant to rip out functionality
> that they put in --- I don't think we understand GIST better than they
> did ;-)

ok. we save the code for future. Probably we could ask original author
btw, who is the orig, author (Hellerstein, Aoki) ?

>
> > 2. There is some discrepancy between calculation of free space on page and
> >    the size of tuple saved on page - calculation of free space on page
> >    by gistNoSpace uses compressed tuple but tuple itself saved after
> >    recompression. It's possible that size of tupple could changed
> >    after recompression.
>
> Yes, that's a serious problem with the idea.  We'd have to suppose that
> recompression could not increase the size of the tuple --- or else be
> prepared to back up and find another page and do it again (ugh).
>

We have to keep in mind this when return to gistPageAddItem.

> > 3. decompress/compress could slowdown insert  because it happens
> >    for every tuple.
>
> Seems like there should be a flag somewhere that tells whether the
> compression function actually cares about the page context or not.
> If not, you could skip the useless processing.
>

ok. see above.

> > 4. Currently gistPageAddItem is broken because it's not toast safe
> >    (see call gist_tuple_replacekey in gistPageAddItem)
>
> Not sure I see the problem.  gist_tuple_replacekey is kinda ugly, but
> what's it got to do with TOAST?
>

tuple could be formed not by index_formtuple which is a correct way.


>             regards, tom lane
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: Plans for solving the VACUUM problem

From
Oleg Bartunov
Date:
On Fri, 18 May 2001, Bruce Momjian wrote:

> > Oleg Bartunov <oleg@sai.msu.su> writes:
> > > On Thu, 17 May 2001, Tom Lane wrote:
> > >> We will also want to look at upgrading the non-btree index types to allow
> > >> concurrent operations.
> >
> > > am I right you plan to work with GiST indexes as well ?
> > > We read a paper "Concurrency and Recovery in Generalized Search Trees"
> > > by Marcel Kornacker, C. Mohan, Joseph Hellerstein
> > > (http://citeseer.nj.nec.com/kornacker97concurrency.html)
> > > and probably we could go in this direction. Right now we're working
> > > on adding of multi-key support to GiST.
> >
> > Yes, GIST should be upgraded to do concurrency.  But I have no objection
> > if you want to work on multi-key support first.
> >
> > My feeling is that a few releases from now we will have btree and GIST
> > as the preferred/well-supported index types.  Hash and rtree might go
> > away altogether --- AFAICS they don't do anything that's not done as
> > well or better by btree or GIST, so what's the point of maintaining
> > them?
>
> We clearly have too many index types, and we are actively telling people
> not to use hash.  And Oleg, don't you have a better version of GIST rtree
> than our native rtree?

We have rtree implementation using GiST - it incomplete, just box.
look at http://www.sai.msu.su/~megera/postgres/gist/
We ported old code from pg95 time. But it's not difficult to port
remaining part. I'm not sure if our version is better, we didn't thoroughly
test, but seems that memory requirements is better and much faster
index construction.

>
> I certainly like streamlining our stuff.
>


>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
>> The comment says
>> 
>> ** Take a compressed entry, and install it on a page.  Since we now know
>> ** where the entry will live, we decompress it and recompress it using
>> ** that knowledge (some compression routines may want to fish around
>> ** on the page, for example, or do something special for leaf nodes.)
>> 
>> Are you prepared to say that you will no longer support the ability for
>> GIST compression routines to do those things?  That seems shortsighted.

> No-no !!! we don't intend to lose that (compression) functionality.

> there are several reason we want to eliminate gistPageAddItem:
> 1. It seems there are no examples where compress uses information about
>    the page.

We have none now, perhaps, but the original GIST authors seemed to think
it would be a useful capability.  I'm hesitant to rip out functionality
that they put in --- I don't think we understand GIST better than they
did ;-)

> 2. There is some discrepancy between calculation of free space on page and
>    the size of tuple saved on page - calculation of free space on page
>    by gistNoSpace uses compressed tuple but tuple itself saved after
>    recompression. It's possible that size of tupple could changed
>    after recompression.

Yes, that's a serious problem with the idea.  We'd have to suppose that
recompression could not increase the size of the tuple --- or else be
prepared to back up and find another page and do it again (ugh).

> 3. decompress/compress could slowdown insert  because it happens
>    for every tuple.

Seems like there should be a flag somewhere that tells whether the
compression function actually cares about the page context or not.
If not, you could skip the useless processing.

> 4. Currently gistPageAddItem is broken because it's not toast safe
>    (see call gist_tuple_replacekey in gistPageAddItem)

Not sure I see the problem.  gist_tuple_replacekey is kinda ugly, but
what's it got to do with TOAST?
        regards, tom lane


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> I have been thinking about the problem of VACUUM and how we 
> might fix it for 7.2.  Vadim has suggested that we should
> attack this by implementing an overwriting storage manager
> and transaction UNDO, but I'm not totally comfortable with
> that approach: it seems to me that it's an awfully large
> change in the way Postgres works. 

I'm not sure if we should implement overwriting smgr at all.
I was/is going to solve space reusing problem with non-overwriting
one, though I'm sure that we have to reimplement it (> 1 table
per data file, stored on disk FSM etc).

> Second: if VACUUM can run in the background, then there's no
> reason not to run it fairly frequently. In fact, it could become
> an automatically scheduled activity like CHECKPOINT is now,
> or perhaps even a continuously running daemon (which was the
> original conception of it at Berkeley, BTW).

And original authors concluded that daemon was very slow in
reclaiming dead space, BTW.

> 3. Lazy VACUUM processes a table in five stages:
>    A. Scan relation looking for dead tuples;...
>    B. Remove index entries for the dead tuples...
>    C. Physically delete dead tuples and compact free space...
>    D. Truncate any completely-empty pages at relation's end.  
>    E. Create/update FSM entry for the table.
...
> If a tuple is dead, we care not whether its index entries are still
> around or not; so there's no risk to logical consistency.

What does this sentence mean? We canNOT remove dead heap tuple untill
we know that there are no index tuples referencing it and your A,B,C
reflect this, so ..?

> Another place where lazy VACUUM may be unable to do its job completely
> is in compaction of space on individual disk pages.  It can physically
> move tuples to perform compaction only if there are not currently any
> other backends with pointers into that page (which can be tested by
> looking to see if the buffer reference count is one).  Again, we punt
> and leave the space to be compacted next time if we can't do it right
> away.

We could keep share buffer lock (or add some other kind of lock)
untill tuple projected - after projection we need not to read data
for fetched tuple from shared buffer and time between fetching
tuple and projection is very short, so keeping lock on buffer will
not impact concurrency significantly.

Or we could register callback cleanup function with buffer so bufmgr
would call it when refcnt drops to 0.

> Presently, VACUUM deletes index tuples by doing a standard index
> scan and checking each returned index tuple to see if it points
> at any of the tuples to be deleted. If so, the index AM is called
> back to delete the tested index tuple. This is horribly inefficient:
...
> This is mainly a problem of a poorly chosen API. The index AMs
> should offer a "bulk delete" call, which is passed a sorted array
> of main-table TIDs. The loop over the index tuples should happen
> internally to the index AM.

I agreed with others who think that the main problem of index cleanup
is reading all index data pages to remove some index tuples. You told
youself about partial heap scanning - so for each scanned part of table
you'll have to read all index pages again and again - very good way to
trash buffer pool with big indices.

Well, probably it's ok for first implementation and you'll win some CPU
with "bulk delete" - I'm not sure how much, though, and there is more
significant issue with index cleanup if table is not locked exclusively:
concurrent index scan returns tuple (and unlock index page), heap_fetch
reads table row and find that it's dead, now index scan *must* find
current index tuple to continue, but background vacuum could already
remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved...");

Two ways: hold index page lock untill heap tuple is checked or (rough
schema)
store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
is used by some scan so cleaner could change stored TID (get one from prev
index tuple) and set flag to help scan restore its current position on
return.

I'm particularly interested in discussing this issue because of it must be
resolved for UNDO and chosen way will affect in what volume we'll be able
to implement dirty reads (first way doesn't allow to implement them in full
- ie selects with joins, - but good enough to resolve RI constraints
concurrency issue).

> There you have it.  If people like this, I'm prepared to commit to
> making it happen for 7.2.  Comments, objections, better ideas?

Well, my current TODO looks as (ORDER BY PRIORITY DESC):

1. UNDO;
2. New SMGR;
3. Space reusing.

and I cannot commit at this point anything about 3. So, why not to refine
vacuum if you want it. I, personally, was never be able to convince myself
to spend time for this.

Vadim


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Well, my current TODO looks as (ORDER BY PRIORITY DESC):
> 
> 1. UNDO;
> 2. New SMGR;
> 3. Space reusing.
> 
> and I cannot commit at this point anything about 3. So, why not to refine
> vacuum if you want it. I, personally, was never be able to convince myself
> to spend time for this.

Vadim, can you remind me what UNDO is used for?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> If a tuple is dead, we care not whether its index entries are still
>> around or not; so there's no risk to logical consistency.

> What does this sentence mean? We canNOT remove dead heap tuple untill
> we know that there are no index tuples referencing it and your A,B,C
> reflect this, so ..?

Sorry if it wasn't clear.  I meant that if the vacuum process fails
after removing an index tuple but before removing the (dead) heap tuple
it points to, there's no need to try to undo.  That state is OK, and
when we next get a chance to vacuum we'll still be able to finish
removing the heap tuple.

>> Another place where lazy VACUUM may be unable to do its job completely
>> is in compaction of space on individual disk pages.  It can physically
>> move tuples to perform compaction only if there are not currently any
>> other backends with pointers into that page (which can be tested by
>> looking to see if the buffer reference count is one).  Again, we punt
>> and leave the space to be compacted next time if we can't do it right
>> away.

> We could keep share buffer lock (or add some other kind of lock)
> untill tuple projected - after projection we need not to read data
> for fetched tuple from shared buffer and time between fetching
> tuple and projection is very short, so keeping lock on buffer will
> not impact concurrency significantly.

Or drop the pin on the buffer to show we no longer have a pointer to it.
I'm not sure that the time to do projection is short though --- what
if there are arbitrary user-defined functions in the quals or the
projection targetlist?

> Or we could register callback cleanup function with buffer so bufmgr
> would call it when refcnt drops to 0.

Hmm ... might work.  There's no guarantee that the refcnt would drop to
zero before the current backend exits, however.  Perhaps set a flag in
the shared buffer header, and the last guy to drop his pin is supposed
to do the cleanup?  But then you'd be pushing VACUUM's work into
productive transactions, which is probably not the way to go.

>> This is mainly a problem of a poorly chosen API. The index AMs
>> should offer a "bulk delete" call, which is passed a sorted array
>> of main-table TIDs. The loop over the index tuples should happen
>> internally to the index AM.

> I agreed with others who think that the main problem of index cleanup
> is reading all index data pages to remove some index tuples.

For very small numbers of tuples that might be true.  But I'm not
convinced it's worth worrying about.  If there aren't many tuples to
be freed, perhaps VACUUM shouldn't do anything at all.

> Well, probably it's ok for first implementation and you'll win some CPU
> with "bulk delete" - I'm not sure how much, though, and there is more
> significant issue with index cleanup if table is not locked exclusively:
> concurrent index scan returns tuple (and unlock index page), heap_fetch
> reads table row and find that it's dead, now index scan *must* find
> current index tuple to continue, but background vacuum could already
> remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved...");

Hm.  Good point ...

> Two ways: hold index page lock untill heap tuple is checked or (rough
> schema)
> store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
> is used by some scan so cleaner could change stored TID (get one from prev
> index tuple) and set flag to help scan restore its current position on
> return.

Another way is to mark the index tuple "gone but not forgotten", so to
speak --- mark it dead without removing it.  (We could know that we need
to do that if we see someone else has a buffer pin on the index page.)
In this state, the index scan coming back to work would still be allowed
to find the index tuple, but no other index scan would stop on the
tuple.  Later passes of vacuum would eventually remove the index tuple,
whenever vacuum happened to pass through at an instant where no one has
a pin on that index page.

None of these seem real clean though.  Needs more thought.

> Well, my current TODO looks as (ORDER BY PRIORITY DESC):

> 1. UNDO;
> 2. New SMGR;
> 3. Space reusing.

> and I cannot commit at this point anything about 3. So, why not to refine
> vacuum if you want it. I, personally, was never be able to convince myself
> to spend time for this.

Okay, good.  I was worried that this idea would conflict with what you
were doing, but it seems it won't.
        regards, tom lane


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> Vadim, can you remind me what UNDO is used for?

Ok, last reminder -:))

On transaction abort, read WAL records and undo (rollback)
changes made in storage. Would allow:

1. Reclaim space allocated by aborted transactions.
2. Implement SAVEPOINTs.  Just to remind -:) - in the event of error discovered by server  - duplicate key, deadlock,
commandmistyping, etc, - transaction  will be rolled back to the nearest implicit savepoint setted  just before query
execution;- or transaction can be aborted by  ROLLBACK TO <savepoint_name> command to some explicit savepoint  setted
byuser. Transaction rolled back to savepoint may be continued.
 
3. Reuse transaction IDs on postmaster restart.
4. Split pg_log into small files with ability to remove old ones (which  do not hold statuses for any running
transactions).

Vadim


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> Vadim, can you remind me what UNDO is used for?
> Ok, last reminder -:))

> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:

> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

Hm.  On the other hand, relying on WAL for undo means you cannot drop
old WAL segments that contain records for any open transaction.  We've
already seen several complaints that the WAL logs grow unmanageably huge
when there is a long-running transaction, and I think we'll see a lot
more.

It would be nicer if we could drop WAL records after a checkpoint or two,
even in the presence of long-running transactions.  We could do that if
we were only relying on them for crash recovery and not for UNDO.

Looking at the advantages:

1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
make lightweight VACUUM work well.  (I definitely don't like the idea
that after a very long transaction fails and aborts, I'd have to wait
another very long time for UNDO to do its thing before I could get on
with my work.  Would much rather have the space reclamation happen in
background.)

2. SAVEPOINTs would be awfully nice to have, I agree.

3. Reusing xact IDs would be nice, but there's an answer with a lot less
impact on the system: go to 8-byte xact IDs.  Having to shut down the
postmaster when you approach the 4Gb transaction mark isn't going to
impress people who want a 24x7 commitment, anyway.

4. Recycling pg_log would be nice too, but we've already discussed other
hacks that might allow pg_log to be kept finite without depending on
UNDO (or requiring postmaster restarts, IIRC).

I'm sort of thinking that undoing back to a savepoint is the only real
usefulness of WAL-based UNDO.  Is it practical to preserve the WAL log
just back to the last savepoint in each xact, not the whole xact?

Another thought: do we need WAL UNDO at all to implement savepoints?
Is there some way we could do them like nested transactions, wherein
each savepoint-to-savepoint segment is given its own transaction number?
Committing multiple xact IDs at once might be a little tricky, but it
seems like a narrow, soluble problem.  Implementing UNDO without
creating lots of performance issues looks a lot harder.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, May 18, 2001 at 06:10:10PM -0700, Mikheev, Vadim wrote:
> > Vadim, can you remind me what UNDO is used for?
> 
> Ok, last reminder -:))
> 
> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:
> 
> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

I missed the original discussions; apologies if this has already been
beaten into the ground.  But... mightn't sub-transactions be a 
better-structured way to expose this service?

Nathan Myers
ncm@zembu.com


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Another thought: do we need WAL UNDO at all to implement savepoints?
> Is there some way we could do them like nested transactions, wherein
> each savepoint-to-savepoint segment is given its own transaction number?
> Committing multiple xact IDs at once might be a little tricky, but it
> seems like a narrow, soluble problem.  Implementing UNDO without
> creating lots of performance issues looks a lot harder.

I am confused why we can't implement subtransactions as part of our
command counter?  The counter is already 4 bytes long.  Couldn't we
rollback to counter number X-10?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am confused why we can't implement subtransactions as part of our
> command counter?  The counter is already 4 bytes long.  Couldn't we
> rollback to counter number X-10?

That'd work within your own transaction, but not from outside it.
After you commit, how will other backends know which command-counter
values of your transaction to believe, and which not?
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am confused why we can't implement subtransactions as part of our
> > command counter?  The counter is already 4 bytes long.  Couldn't we
> > rollback to counter number X-10?
> 
> That'd work within your own transaction, but not from outside it.
> After you commit, how will other backends know which command-counter
> values of your transaction to believe, and which not?

Seems we would have to store the command counters for the parts of the
transaction that committed, or the ones that were rolled back.  Yuck.

I hate to add UNDO complexity just for subtransactions.

Hey, I have an idea.  Can we do subtransactions as separate transactions
(as Tom mentioned), and put the subtransaction id's in the WAL, so they
an be safely committed/rolledback as a group?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Hey, I have an idea.  Can we do subtransactions as separate transactions
> (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> an be safely committed/rolledback as a group?

It's not quite that easy: all the subtransactions have to commit at
*the same time* from the point of view of other xacts, or you have
consistency problems.  So there'd need to be more xact-commit mechanism
than there is now.  Snapshots are also interesting; we couldn't use a
single xact ID per backend to show the open-transaction state.

WAL doesn't really enter into it AFAICS...
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > an be safely committed/rolledback as a group?
> 
> It's not quite that easy: all the subtransactions have to commit at
> *the same time* from the point of view of other xacts, or you have
> consistency problems.  So there'd need to be more xact-commit mechanism
> than there is now.  Snapshots are also interesting; we couldn't use a
> single xact ID per backend to show the open-transaction state.

Yes, I knew that was going to come up that you have to add a lock to the
pg_log that is only in affect when someone is commiting a transaction
with subtransactions.  Normal transactions get read/sharedlock, while
subtransaction needs exclusive/writelock.

Seems a lot easier than UNDO.  Vadim you mentioned UNDO would allow
space reuse for rolledback transactions, but in most cases the space
reuse is going to be for old copies of committed transactions, right? 
Were you going to use WAL to get free space from old copies too?

Vadim, I think I am missing something.  You mentioned UNDO would be used
for these cases and I don't understand the purpose of adding what would
seem to be a pretty complex capability:

> 1. Reclaim space allocated by aborted transactions.

Is there really a lot to be saved here vs. old tuples of committed
transactions?

> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be
>    continued.

Discussing, perhaps using multiple transactions.

> 3. Reuse transaction IDs on postmaster restart.

Doesn't seem like a huge win.

> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).

That one is interesting.  Seems the only workaround for that would be to
allow a global scan of all databases and tables to set commit flags,
then shrink pg_log and set XID offset as start of file.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Kaare Rasmussen
Date:
> Second: if VACUUM can run in the background, then there's no reason not
> to run it fairly frequently.  In fact, it could become an automatically
> scheduled activity like CHECKPOINT is now, or perhaps even a continuously
> running daemon (which was the original conception of it at Berkeley, BTW).

Maybe it's obvious, but I'd like to mention that you need some way of setting 
priority. If it's a daemon, or a process, you an nice it. If not, you need to 
implement something by yourself.

-- 
Kaare Rasmussen            --Linux, spil,--        Tlf:        3816 2582
Kaki Data                tshirts, merchandize      Fax:        3816 2501
Howitzvej 75               Åben 14.00-18.00        Web:      www.suse.dk
2000 Frederiksberg        Lørdag 11.00-17.00       Email: kar@webline.dk


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > > an be safely committed/rolledback as a group?
> > 
> > It's not quite that easy: all the subtransactions have to commit at
> > *the same time* from the point of view of other xacts, or you have
> > consistency problems.  So there'd need to be more xact-commit mechanism
> > than there is now.  Snapshots are also interesting; we couldn't use a
> > single xact ID per backend to show the open-transaction state.
> 
> Yes, I knew that was going to come up that you have to add a lock to the
> pg_log that is only in affect when someone is commiting a transaction
> with subtransactions.  Normal transactions get read/sharedlock, while
> subtransaction needs exclusive/writelock.

I was wrong here.  Multiple backends can write to pg_log at the same
time, even subtraction ones.  It is just that no backend can read from
pg_log during a subtransaction commit.  Acctually, they can if the are
reading a transaction status that is less than the minium active
transaction id, see GetXmaxRecent().

Doesn't seem too bad.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Hey, I have an idea.  Can we do subtransactions as separate transactions
> > (as Tom mentioned), and put the subtransaction id's in the WAL, so they
> > an be safely committed/rolledback as a group?
> 
> It's not quite that easy: all the subtransactions have to commit at
> *the same time* from the point of view of other xacts, or you have
> consistency problems.  So there'd need to be more xact-commit mechanism
> than there is now.  Snapshots are also interesting; we couldn't use a
> single xact ID per backend to show the open-transaction state.

OK, I have another idea about subtransactions as multiple transaction
ids.

I realize that the snapshot problem would be an issue, because now
instead of looking at your own transaction id, you have to look at
multiple transaction ids.  We could do this as a List of xid's, but that
will not scale well.

My idea is for a subtransaction backend to have its own pg_log-style
memory area that shows which transactions it owns and has
committed/aborted.  It can have the log start at its start xid, and can
look in pg_log and in there anytime it needs to check the visibility of
a transaction greater than its minium xid.  16k can hold 64k xids, so it
seems it should scale pretty well.  (Each xid is two bits in pg_log.)

In fact, multi-query transactions are just a special case of
subtransactions, where all previous subtransactions are
committed/visible.  We could use the same pg_log-style memory area for
multi-query transactions, eliminating the command counter  and saving 8
bytes overhead per tuple.

Currently, the XMIN/XMAX command counters are used only by the current
transaction, and they are useless once the transaction finishes and take
up 8 bytes on disk.

So, this idea gets us subtransactions and saves 8 bytes overhead.  This
reduces our per-tuple overhead from 36 to 28 bytes, a 22% reduction!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Kaare Rasmussen <kar@webline.dk> writes:
>> Second: if VACUUM can run in the background, then there's no reason not
>> to run it fairly frequently.  In fact, it could become an automatically
>> scheduled activity like CHECKPOINT is now, or perhaps even a continuously
>> running daemon (which was the original conception of it at Berkeley, BTW).

> Maybe it's obvious, but I'd like to mention that you need some way of
> setting priority.

No, you don't --- or at least it's far less easy than it looks.  If any
one of the backends gets niced, then you have a priority inversion
problem.  That backend may be holding a lock that other backends want.
If it's not running because it's been niced, then the other backends
that are not niced are still kept from running.

Even if we wanted to implement priority inversion detection (painful
for locks, and completely impractical for spinlocks), there wouldn't
be anything we could do about it: Unix doesn't allow non-root processes
to reduce their nice setting.

Obviously any automatically-scheduled VACUUM would need some kind of
throttling mechanism to keep it from running constantly.  But that's not
a priority setting.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> In fact, multi-query transactions are just a special case of
> subtransactions, where all previous subtransactions are
> committed/visible.  We could use the same pg_log-style memory area for
> multi-query transactions, eliminating the command counter  and saving 8
> bytes overhead per tuple.

Interesting thought, but command IDs don't act the same as transactions;
in particular, visibility of one scan to another doesn't necessarily
depend on whether the scan has finished.

Possibly that could be taken into account by having different rules for
"do we think it's committed" in the local pg_log than the global one.

Also, this distinction would propagate out of the xact status code;
for example, it wouldn't do for heapam to set the "known committed"
bit on a tuple just because it's from a previous subtransaction of the
current xact.  Right now that works because heapam knows the difference
between xacts and commands; it would still have to know the difference.

A much more significant objection is that such a design would eat xact
IDs at a tremendous rate, to no purpose.  CommandCounterIncrement is a
cheap operation now, and we do it with abandon.  It would not be cheap
if it implied allocating a new xact ID that would eventually need to be
marked committed.  I don't mind allocating a new xact ID for each
explicitly-created savepoint, but a new ID per CommandCounterIncrement
is a different story.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> Hm.  On the other hand, relying on WAL for undo means you cannot drop
> old WAL segments that contain records for any open transaction.  We've
> already seen several complaints that the WAL logs grow unmanageably huge
> when there is a long-running transaction, and I think we'll see a lot
> more.
> 
> It would be nicer if we could drop WAL records after a checkpoint or two,
> even in the presence of long-running transactions.  We could do that if
> we were only relying on them for crash recovery and not for UNDO.

As you understand this is old, well-known problem in database practice,
described in books. Two ways - either abort too long running transactions
or (/and) compact old log segments: fetch and save (to use for undo)
records of long-running transactions and remove other records. Neither
way is perfect but nothing is perfect at all -:)

> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> make lightweight VACUUM work well.  (I definitely don't like the idea

Sorry, but I'm going to consider background vacuum as temporary solution
only. As I've already pointed, original PG authors finally became
disillusioned with the same approach. What is good in using UNDO for 1.
is the fact that WAL records give you *direct* physical access to changes
which should be rolled back.

> that after a very long transaction fails and aborts, I'd have to wait
> another very long time for UNDO to do its thing before I could get on
> with my work.  Would much rather have the space reclamation happen in
> background.)

Understandable, but why other transactions should read dirty data again
and again waiting for background vacuum? I think aborted transaction
should take some responsibility for mess made by them -:)
And keeping in mind 2. very long transactions could be continued -:)

> 2. SAVEPOINTs would be awfully nice to have, I agree.
> 
> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
> impact on the system: go to 8-byte xact IDs.  Having to shut down the
> postmaster when you approach the 4Gb transaction mark isn't going to
> impress people who want a 24x7 commitment, anyway.

+8 bytes in tuple header is not so tiny thing.

> 4. Recycling pg_log would be nice too, but we've already discussed other
> hacks that might allow pg_log to be kept finite without depending on
> UNDO (or requiring postmaster restarts, IIRC).

We did... and didn't get agreement.

> I'm sort of thinking that undoing back to a savepoint is the only real
> usefulness of WAL-based UNDO. Is it practical to preserve the WAL log
> just back to the last savepoint in each xact, not the whole xact?

No, it's not. It's not possible in overwriting systems at all - all
transaction records are required.

> Another thought: do we need WAL UNDO at all to implement savepoints?
> Is there some way we could do them like nested transactions, wherein
> each savepoint-to-savepoint segment is given its own transaction number?
> Committing multiple xact IDs at once might be a little tricky, but it
> seems like a narrow, soluble problem.

Implicit savepoints wouldn't be possible - this is very convenient
feature I've found in Oracle.
And additional code in tqual.c wouldn't be good addition.

> Implementing UNDO without creating lots of performance issues looks
> a lot harder.

What *performance* issues?!
The only issue is additional disk requirements.

Vadim




Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
>> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
>> make lightweight VACUUM work well.

> Sorry, but I'm going to consider background vacuum as temporary solution
> only. As I've already pointed, original PG authors finally became
> disillusioned with the same approach.

How could they become disillusioned with it, when they never tried it?
I know of no evidence that any version of PG has had backgroundable
(non-blocking-to-other-transactions) VACUUM, still less within-relation
space recycling.  They may have become disillusioned with the form of
VACUUM that they actually had (ie, the same one we've inherited) --- but
please don't call that "the same approach" I'm proposing.

Certainly, doing VACUUM this way is an experiment that may fail, or may
require further work before it really works well.  But I'd appreciate it
if you wouldn't prejudge the results of the experiment.

>> Would much rather have the space reclamation happen in
>> background.)

> Understandable, but why other transactions should read dirty data again
> and again waiting for background vacuum? I think aborted transaction
> should take some responsibility for mess made by them -:)

They might read it again and again before the failed xact gets around to
removing the data, too.  You cannot rely on UNDO for correctness; at
most it can be a speed/space optimization.  I see no reason to assume
that it's a more effective optimization than a background vacuum
process.

>> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
>> impact on the system: go to 8-byte xact IDs.

> +8 bytes in tuple header is not so tiny thing.

Agreed, but the people who need 8-byte IDs are not running small
installations.  I think they'd sooner pay a little more in disk space
than risk costs in performance or reliability.

>> Another thought: do we need WAL UNDO at all to implement savepoints?
>> Is there some way we could do them like nested transactions, wherein
>> each savepoint-to-savepoint segment is given its own transaction number?

> Implicit savepoints wouldn't be possible - this is very convenient
> feature I've found in Oracle.

Why not?  Seems to me that establishing implicit savepoints is just a
user-interface issue; you can do it, or not do it, regardless of the
underlying mechanism.

>> Implementing UNDO without creating lots of performance issues looks
>> a lot harder.

> What *performance* issues?!
> The only issue is additional disk requirements.

Not so.  UNDO does failed-transaction cleanup work in the interactive
backends, where it necessarily delays clients who might otherwise be
issuing their next command.  A VACUUM-based approach does the cleanup
work in the background.  Same work, more or less, but it's not in the
clients' critical path.

BTW, UNDO for failed transactions alone will not eliminate the need for
VACUUM.  Will you also make successful transactions go back and
physically remove the tuples they deleted?
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> >> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> >> make lightweight VACUUM work well.
> 
> > Sorry, but I'm going to consider background vacuum as temporary solution
> > only. As I've already pointed, original PG authors finally became
> > disillusioned with the same approach.
> 
> How could they become disillusioned with it, when they never tried it?
> I know of no evidence that any version of PG has had backgroundable
> (non-blocking-to-other-transactions) VACUUM, still less within-relation
> space recycling.  They may have become disillusioned with the form of
> VACUUM that they actually had (ie, the same one we've inherited) --- but
> please don't call that "the same approach" I'm proposing.

Pre-Postgres'95 (original) versions had vacuum daemon running in
background. I don't know if that vacuum shrinked relations or not
(there was no shrinking in '95 version), I know that daemon had to
do some extra work in moving old tuples to archival storage, but
anyway as you can read in old papers in the case of consistent heavy
load daemon was not able to cleanup storage fast enough. And the
reason is obvious - no matter how optimized your daemon will be
(in regard to blocking other transactions etc), it will have to
perform huge amount of IO just to find space available for reclaiming.

> Certainly, doing VACUUM this way is an experiment that may fail, or may
> require further work before it really works well. But I'd appreciate it
> if you wouldn't prejudge the results of the experiment.

Why not, Tom? Why shouldn't I say my opinion?
Last summer your comment about WAL, may experiment that time, was that
it will save just a few fsyncs. It was your right to make prejudment,
what's wrong with my rights? And you appealed to old papers as well, BTW.

> > Understandable, but why other transactions should read dirty data again
> > and again waiting for background vacuum? I think aborted transaction
> > should take some responsibility for mess made by them -:)
> 
> They might read it again and again before the failed xact gets around to
> removing the data, too.  You cannot rely on UNDO for correctness; at
> most it can be a speed/space optimization. I see no reason to assume
> that it's a more effective optimization than a background vacuum
> process.

Really?! Once again: WAL records give you *physical* address of tuples
(both heap and index ones!) to be removed and size of log to read
records from is not comparable with size of data files.

> >> Another thought: do we need WAL UNDO at all to implement savepoints?
> >> Is there some way we could do them like nested transactions, wherein
> >> each savepoint-to-savepoint segment is given its own transaction number?
> 
> > Implicit savepoints wouldn't be possible - this is very convenient
> > feature I've found in Oracle.
> 
> Why not?  Seems to me that establishing implicit savepoints is just a
> user-interface issue; you can do it, or not do it, regardless of the
> underlying mechanism.

Implicit savepoints are setted by server automatically before each
query execution - you wouldn't use transaction IDs for this.

> >> Implementing UNDO without creating lots of performance issues looks
> >> a lot harder.
> 
> > What *performance* issues?!
> > The only issue is additional disk requirements.
> 
> Not so. UNDO does failed-transaction cleanup work in the interactive
> backends, where it necessarily delays clients who might otherwise be
> issuing their next command.  A VACUUM-based approach does the cleanup
> work in the background. Same work, more or less, but it's not in the
> clients' critical path.

Not same work but much more and in the critical pathes of all clients.
And - is overall performance of Oracle or Informix worse then in PG?
Seems delays in clients for rollback doesn't affect performance so much.
But dirty storage does it.

> BTW, UNDO for failed transactions alone will not eliminate the need for
> VACUUM.  Will you also make successful transactions go back and
> physically remove the tuples they deleted?

They can't do this, as you know pretty well. But using WAL to get TIDs to
be deleted is considerable, no?

Vadim




Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> Were you going to use WAL to get free space from old copies too?

Considerable approach.

> Vadim, I think I am missing something.  You mentioned UNDO would be used
> for these cases and I don't understand the purpose of adding what would
> seem to be a pretty complex capability:

Yeh, we already won title of most advanced among simple databases, -:)
Yes, looking in list of IDs assigned to single transaction in tqual.c is much
easy to do than UNDO. As well as couple of fsyncs is easy than WAL.

> > 1. Reclaim space allocated by aborted transactions.
> 
> Is there really a lot to be saved here vs. old tuples of committed
> transactions?

Are you able to protect COPY FROM from abort/crash?

Vadim




Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Really?! Once again: WAL records give you *physical* address of tuples
> (both heap and index ones!) to be removed and size of log to read
> records from is not comparable with size of data files.

You sure?  With our current approach of dumping data pages into the WAL
on first change since checkpoint (and doing so again after each
checkpoint) it's not too difficult to devise scenarios where the WAL log
is *larger* than the affected datafiles ... and can't be truncated until
someone commits.

The copied-data-page traffic is the worst problem with our current
WAL implementation.  I did some measurements last week on VACUUM of a
test table (the accounts table from a "pg_bench -s 10" setup, which
contains 1000000 rows; I updated 20000 rows and then vacuumed).  This
generated about 34400 8k blocks of WAL traffic, of which about 33300
represented copied pages and the other 1100 blocks were actual WAL
entries.  That's a pretty massive I/O overhead, considering the table
itself was under 20000 8k blocks.  It was also interesting to note that
a large fraction of the CPU time was spent calculating CRCs on the WAL
data.

Would it be possible to split the WAL traffic into two sets of files,
one for WAL log records proper and one for copied pages?  Seems like
we could recycle the pages after each checkpoint rather than hanging
onto them until the associated transactions commit.

>> Why not?  Seems to me that establishing implicit savepoints is just a
>> user-interface issue; you can do it, or not do it, regardless of the
>> underlying mechanism.

> Implicit savepoints are setted by server automatically before each
> query execution - you wouldn't use transaction IDs for this.

If the user asked you to, I don't see why not.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
The Hermit Hacker
Date:
On Sun, 20 May 2001, Vadim Mikheev wrote:

> > >> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> > >> make lightweight VACUUM work well.
> >
> > > Sorry, but I'm going to consider background vacuum as temporary solution
> > > only. As I've already pointed, original PG authors finally became
> > > disillusioned with the same approach.
> >
> > How could they become disillusioned with it, when they never tried it?
> > I know of no evidence that any version of PG has had backgroundable
> > (non-blocking-to-other-transactions) VACUUM, still less within-relation
> > space recycling.  They may have become disillusioned with the form of
> > VACUUM that they actually had (ie, the same one we've inherited) --- but
> > please don't call that "the same approach" I'm proposing.
>
> Pre-Postgres'95 (original) versions had vacuum daemon running in
> background. I don't know if that vacuum shrinked relations or not
> (there was no shrinking in '95 version), I know that daemon had to
> do some extra work in moving old tuples to archival storage, but
> anyway as you can read in old papers in the case of consistent heavy
> load daemon was not able to cleanup storage fast enough. And the
> reason is obvious - no matter how optimized your daemon will be
> (in regard to blocking other transactions etc), it will have to
> perform huge amount of IO just to find space available for reclaiming.
>
> > Certainly, doing VACUUM this way is an experiment that may fail, or may
> > require further work before it really works well. But I'd appreciate it
> > if you wouldn't prejudge the results of the experiment.
>
> Why not, Tom? Why shouldn't I say my opinion?
> Last summer your comment about WAL, may experiment that time, was that
> it will save just a few fsyncs. It was your right to make prejudment,
> what's wrong with my rights? And you appealed to old papers as well, BTW.

If its an "experiment", shouldn't it be done outside of the main source
tree, with adequate testing in a high load situation, with a patch
released to the community for further testing/comments, before it is added
to the source tree?  From reading Vadim's comment above (re:
pre-Postgres95), this daemonized approach would cause a high I/O load on
the server in a situation where there are *alot* of UPDATE/DELETEs
happening to the database, which should be easily recreatable, no?  Or,
Vadim, am I misundertanding?




Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree?

Mebbe we should've handled WAL that way too ;-)

Seriously, I don't think that my proposed changes need be treated with
quite that much suspicion.  The only part that is really intrusive is
the shared-memory free-heap-space-management change.  But AFAICT that
will be a necessary component of *any* approach to getting rid of
VACUUM.  We've been arguing here, in essence, about whether a background
or on-line approach to finding free space will be more useful; but that
still leaves you with the question of what you do with the free space
after you've found it.  Without some kind of shared free space map,
there's not anything you can do except have the process that found the
space do tuple moving and file truncation --- ie, VACUUM.  So even if
I'm quite wrong about the effectiveness of a background VACUUM, the FSM
code will still be needed: an UNDO-style approach is also going to need
an FSM to do anything with the free space it finds.  It's equally clear
that the index AMs have to support index tuple deletion without
exclusive lock, or we'll still have blocking problems during free-space
cleanup, no matter what drives that cleanup.  The only part of what
I've proposed that might end up getting relegated to the scrap heap is
the "lazy vacuum" command itself, which will be a self-contained and
relatively small module (smaller than the present commands/vacuum.c,
for sure).

Besides which, Vadim has already said that he won't have time to do
anything about space reclamation before 7.2.  So even if background
vacuum does end up getting superseded by something better, we're going
to need it for a release or two ...
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Lincoln Yeoh
Date:
At 01:09 PM 20-05-2001 -0400, Tom Lane wrote:
>>> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
>>> impact on the system: go to 8-byte xact IDs.
>
>> +8 bytes in tuple header is not so tiny thing.
>
>Agreed, but the people who need 8-byte IDs are not running small
>installations.  I think they'd sooner pay a little more in disk space
>than risk costs in performance or reliability.

An additional 4 (8?) bytes per tuple to increase the "mean time before
problem " 4 billion times sounds good to me.

Cheerio,
Link.



Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> > Really?! Once again: WAL records give you *physical* address of tuples
> > (both heap and index ones!) to be removed and size of log to read
> > records from is not comparable with size of data files.
> 
> You sure?  With our current approach of dumping data pages into the WAL
> on first change since checkpoint (and doing so again after each
> checkpoint) it's not too difficult to devise scenarios where the WAL log
> is *larger* than the affected datafiles ... and can't be truncated until
> someone commits.

Yes, but note mine "size of log to read records from" - each log record
has pointer to previous record made by same transaction: rollback must
not read entire log file to get all records of specific transaction.

> >> Why not?  Seems to me that establishing implicit savepoints is just a
> >> user-interface issue; you can do it, or not do it, regardless of the
> >> underlying mechanism.
> 
> > Implicit savepoints are setted by server automatically before each
> > query execution - you wouldn't use transaction IDs for this.
> 
> If the user asked you to, I don't see why not.

Example of one of implicit savepoint usage: skipping duplicate key insertion.
Using transaction IDs when someone want to insert a few thousand records?

Vadim




Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree?  From reading Vadim's comment above (re:
> pre-Postgres95), this daemonized approach would cause a high I/O load on
> the server in a situation where there are *alot* of UPDATE/DELETEs
> happening to the database, which should be easily recreatable, no?  Or,
> Vadim, am I misundertanding?

It probably will not cause more IO than vacuum does right now.
But unfortunately it will not reduce that IO. Cleanup work will be spreaded
in time and users will not experience long lockouts but average impact
on overall system throughput will be same (or maybe higher).
My point is that we'll need in dynamic cleanup anyway and UNDO is
what should be implemented for dynamic cleanup of aborted changes.
Plus UNDO gives us natural implementation of savepoints and some
abilities in transaction IDs management, which we may use or not
(though, 4. - pg_log size management - is really good thing).

Vadim




Re: Plans for solving the VACUUM problem

From
"Vadim Mikheev"
Date:
> Seriously, I don't think that my proposed changes need be treated with
> quite that much suspicion.  The only part that is really intrusive is

Agreed. I fight for UNDO, not against background vacuum -:)

> the shared-memory free-heap-space-management change.  But AFAICT that
> will be a necessary component of *any* approach to getting rid of
> VACUUM.  We've been arguing here, in essence, about whether a background
> or on-line approach to finding free space will be more useful; but that
> still leaves you with the question of what you do with the free space
> after you've found it.  Without some kind of shared free space map,
> there's not anything you can do except have the process that found the
> space do tuple moving and file truncation --- ie, VACUUM.  So even if
> I'm quite wrong about the effectiveness of a background VACUUM, the FSM
> code will still be needed: an UNDO-style approach is also going to need
> an FSM to do anything with the free space it finds.  It's equally clear

Unfortunately, I think that we'll need in on-disk FSM and that FSM is
actually the most complex thing to do in "space reclamation" project.

> Besides which, Vadim has already said that he won't have time to do
> anything about space reclamation before 7.2.  So even if background
> vacuum does end up getting superseded by something better, we're going
> to need it for a release or two ...

Yes.

Vadim




Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Unfortunately, I think that we'll need in on-disk FSM and that FSM is
> actually the most complex thing to do in "space reclamation" project.

I hope we can avoid on-disk FSM.  Seems to me that that would create
problems both for performance (lots of extra disk I/O) and reliability
(what happens if FSM is corrupted?  A restart won't fix it).

But, if we do need it, most of the work needed to install FSM APIs
should carry over.  So I still don't see an objection to doing
in-memory FSM as a first step.


BTW, I was digging through the old Postgres papers this afternoon,
to refresh my memory about what they actually said about VACUUM.
I was interested to discover that at one time the tuple-insertion
algorithm went as follows: 1. Pick a page at random in the relation, read it in, and see if it    has enough free
space. Repeat up to three times. 2. If #1 fails to find space, append tuple at end.
 
When they got around to doing some performance measurement, they
discovered that step #1 was a serious loser, and dropped it in favor
of pure #2 (which is what we still have today).  Food for thought.
        regards, tom lane


RE: Plans for solving the VACUUM problem

From
"Henshall, Stuart - WCP"
Date:
Apologises if I've missed something, but isn't that the same xmin that ODBC
uses for row versioning?
- Stuart
<Snip>
> Currently, the XMIN/XMAX command counters are used only by the current
> transaction, and they are useless once the transaction finishes and take
> up 8 bytes on disk.<Snip>



Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> It probably will not cause more IO than vacuum does right now.
> But unfortunately it will not reduce that IO.

Uh ... what?  Certainly it will reduce the total cost of vacuum,
because it won't bother to try to move tuples to fill holes.
The index cleanup method I've proposed should be substantially
more efficient than the existing code, as well.

> My point is that we'll need in dynamic cleanup anyway and UNDO is
> what should be implemented for dynamic cleanup of aborted changes.

UNDO might offer some other benefits, but I doubt that it will allow
us to eliminate VACUUM completely.  To do that, you would need to
keep track of free space using exact, persistent (on-disk) bookkeeping
data structures.  The overhead of that will be very substantial: more,
I predict, than the approximate approach I proposed.
        regards, tom lane


Re: RE: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
"Henshall, Stuart - WCP" wrote:
> 
> Apologises if I've missed something, but isn't that the same xmin that ODBC
> uses for row versioning?
> - Stuart
> 
>         <Snip>
> > Currently, the XMIN/XMAX command counters are used only by the current
> > transaction, and they are useless once the transaction finishes and take
> > up 8 bytes on disk.
>         <Snip>

BTW, is there some place where I could read about exact semantics of
sytem fields ?

--------------------
Hannu


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > It probably will not cause more IO than vacuum does right now.
> > But unfortunately it will not reduce that IO.
> 
> Uh ... what?  Certainly it will reduce the total cost of vacuum,
> because it won't bother to try to move tuples to fill holes.

Oh, you're right here, but daemon will most likely read data files
again and again with in-memory FSM. Also, if we'll do partial table
scans then we'll probably re-read indices > 1 time.

> The index cleanup method I've proposed should be substantially
> more efficient than the existing code, as well.

Not in IO area.

> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
> 
> UNDO might offer some other benefits, but I doubt that it will allow
> us to eliminate VACUUM completely.  To do that, you would need to

I never told this -:)

> keep track of free space using exact, persistent (on-disk) bookkeeping
> data structures.  The overhead of that will be very substantial: more,
> I predict, than the approximate approach I proposed.

I doubt that "big guys" use in-memory FSM. If they were able to do this...

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> I hope we can avoid on-disk FSM.  Seems to me that that would create
> problems both for performance (lots of extra disk I/O) and reliability
> (what happens if FSM is corrupted?  A restart won't fix it).

We can use WAL for FSM.

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > Really?! Once again: WAL records give you *physical*
> > address of tuples (both heap and index ones!) to be
> > removed and size of log to read records from is not
> > comparable with size of data files.
> 
> So how about a background "vacuum like" process, that reads
> the WAL and does the cleanup ? Seems that would be great,
> since it then does not need to scan, and does not make
> forground cleanup necessary.
> 
> Problem is when cleanup can not keep up with cleaning WAL
> files, that already want to be removed. I would envision a
> config, that sais how many Mb of WAL are allowed to queue
> up before clients are blocked.

Yes, some daemon could read logs and gather cleanup info.
We could activate it when switching to new log file segment
and synchronization with checkpointer is not big deal. That
daemon would also archive log files for WAL-based BAR,
if archiving is ON.

But this will be useful only with on-disk FSM.

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
> 
> Or drop the pin on the buffer to show we no longer have a pointer to it.

This is not good for seqscans which will return to that buffer anyway.

> > Or we could register callback cleanup function with buffer so bufmgr
> > would call it when refcnt drops to 0.
> 
> Hmm ... might work.  There's no guarantee that the refcnt
> would drop to zero before the current backend exits, however.
> Perhaps set a flag in the shared buffer header, and the last guy
> to drop his pin is supposed to do the cleanup?

This is what I've meant - set (register) some pointer in buffer header
to cleanup function.

> But then you'd be pushing VACUUM's work into productive transactions,
> which is probably not the way to go.

Not big work - I wouldn't worry about it.

> > Two ways: hold index page lock untill heap tuple is checked
> > or (rough schema) store info in shmem (just IndexTupleData.t_tid
> > and flag) that an index tuple is used by some scan so cleaner could
> > change stored TID (get one from prev index tuple) and set flag to
> > help scan restore its current position on return.
> 
> Another way is to mark the index tuple "gone but not forgotten", so to
> speak --- mark it dead without removing it. (We could know that we need
> to do that if we see someone else has a buffer pin on the index page.)

Register cleanup function just like with heap above.

> None of these seem real clean though.  Needs more thought.

Vadim


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Jan Wieck <JanWieck@yahoo.com> writes:
>     I think the in-shared-mem FSM could have  some  max-per-table
>     limit  and  the background VACUUM just skips the entire table
>     as long as nobody  reused  any  space.

I was toying with the notion of trying to use Vadim's "MNMB" idea
(see his description of the work he did for Perlstein last year);
that is, keep track of the lowest block number of any modified block
within each relation since the last VACUUM.  Then VACUUM would only
have to scan from there to the end.  This covers the totally-untouched-
relation case nicely, and also helps a lot for large rels that you're
mostly just adding to or perhaps updating recent additions.

The FSM could probably keep track of such info fairly easily, since
it will already be aware of which blocks it's told backends to try
to insert into.  But it would have to be told about deletes too,
which would mean more FSM access traffic and more lock contention.
Another problem (given my current view of how FSM should work) is that
rels not being used at all would not be in FSM, or would age out of it,
and so you wouldn't know that you didn't need to vacuum them.
So I'm not sure yet if it's a good idea.
        regards, tom lane


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
> 
> I do not yet understand why you want to handle aborts different than
> outdated tuples.

Maybe because of aborted tuples have shorter Time-To-Live.
And probability to find pages for them in buffer pool is higher.

> The ratio in a well tuned system should well favor outdated tuples.
> If someone ever adds "dirty read" it is also not the case that it
> is guaranteed, that nobody accesses the tuple you currently want
> to undo. So I really miss to see the big difference.

It will not be guaranteed anyway as soon as we start removing
tuples without exclusive access to relation.

And, I cannot say that I would implement UNDO because of
1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
but because of ALL of 1., 2., 4.

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> From: Mikheev, Vadim 
> Sent: Monday, May 21, 2001 10:23 AM
> To: 'Jan Wieck'; Tom Lane
> Cc: The Hermit Hacker; 'Bruce Momjian';
> pgsql-hackers@postgresql.orgrg.us.greatbridge.com

Strange address, Jan?

> Subject: RE: [HACKERS] Plans for solving the VACUUM problem
> 
> 
> >     I think the in-shared-mem FSM could have  some  max-per-table
> >     limit  and  the background VACUUM just skips the entire table
> >     as long as nobody  reused  any  space.  Also  it  might  only
> >     compact pages that lead to 25 or more percent of freespace in
> >     the first place. That makes it more likely  that  if  someone
> >     looks  for  a place to store a tuple that it'll fit into that
> >     block (remember that the toaster tries to  keep  main  tuples
> >     below BLKSZ/4).
> 
> This should be configurable parameter like PCFREE (or something
> like that) in Oracle: consider page for insertion only if it's
> PCFREE % empty.
> 
> Vadim
> 


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
> 
> Or drop the pin on the buffer to show we no longer have a pointer
> to it. I'm not sure that the time to do projection is short though
> --- what if there are arbitrary user-defined functions in the quals
> or the projection targetlist?

Well, while we are on this subject I finally should say about issue
bothered me for long time: only "simple" functions should be allowed
to deal with data in shared buffers directly. "Simple" means: no SQL
queries there. Why? One reason: we hold shlock on buffer while doing
seqscan qual - what if qual' SQL queries will try to acquire exclock
on the same buffer? Another reason - concurrency. I think that such
"heavy" functions should be provided with copy of data.

Vadim


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> I'm not sure that the time to do projection is short though
>> --- what if there are arbitrary user-defined functions in the quals
>> or the projection targetlist?

> Well, while we are on this subject I finally should say about issue
> bothered me for long time: only "simple" functions should be allowed
> to deal with data in shared buffers directly. "Simple" means: no SQL
> queries there. Why? One reason: we hold shlock on buffer while doing
> seqscan qual - what if qual' SQL queries will try to acquire exclock
> on the same buffer?

I think we're there already: AFAICT, user-specified quals and
projections are done after dropping the buffer shlock.  (Yes, I know
there's a HeapKeyTest inside heapgettup, but user quals don't get
done there.)  We do still hold a pin, but that seems OK to me.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Barry Lind
Date:

Tom Lane wrote:

> 
> Hm.  On the other hand, relying on WAL for undo means you cannot drop
> old WAL segments that contain records for any open transaction.  We've
> already seen several complaints that the WAL logs grow unmanageably huge
> when there is a long-running transaction, and I think we'll see a lot
> more.
> 
> It would be nicer if we could drop WAL records after a checkpoint or two,
> even in the presence of long-running transactions.  We could do that if
> we were only relying on them for crash recovery and not for UNDO.

In Oracle the REDO and UNDO are separate.  REDO in oracle is limited in 
size to x log files y big (where x and y are parameters at database 
creation).  These x log files are reused in a round robin way with 
checkpoints forced when wrapping around.

REDO in oracle is done by something known as a 'rollback segment'.  
There are many rollback segments.  A transaction is assigned one 
rollback segment to write its UNDO records to.  Transactions are 
assigned to a rollback segment randomly (or you can explicitly assign a 
transaction to a particular rollback segment if you want).  The size of 
rollback segments are limited and transactions are aborted if they 
exceed that size.  This is why oracle has the option to assign a 
specific rollback segment to a transaction, so if you know you are going 
to update/insert tons of stuff you want to use an appropriately sized 
rollback segment for that transaction.

Things I like about the Oracle approach vs current WAL REDO and proposed 
UNDO:

1 Sizes allocated to REDO and UNDO in Oracle are known and 
configurable.  WAL sizes are unknown and not constrained.
2 Oracle allows for big transactions by allowing them to use 
specifically sizied rollback segments to handle the large transaction.
While WAL could be enhanced to fix 1, it appears difficult to have 
different limits for different types of transactions as Oracle supports 
in 2.

Thinks I don't like about the Oracle approach:

1 Not only updates, but also long running queries can be aborted if the 
rollback segment size it too small, as the undo is necessary to create 
an snapshot of the state of the database at the time the query started.

thanks,
--Barry

> 



Re: Plans for solving the VACUUM problem

From
Barry Lind
Date:

Mikheev, Vadim wrote:

> 
> Ok, last reminder -:))
> 
> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:
> 
> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
>    Just to remind -:) - in the event of error discovered by server
>    - duplicate key, deadlock, command mistyping, etc, - transaction
>    will be rolled back to the nearest implicit savepoint setted
>    just before query execution; - or transaction can be aborted by
>    ROLLBACK TO <savepoint_name> command to some explicit savepoint
>    setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.
> 4. Split pg_log into small files with ability to remove old ones (which
>    do not hold statuses for any running transactions).
> 
> Vadim

This is probably not a good thread to add my two cents worth, but here 
goes anyway.

The biggest issue I see with the proposed UNDO using WAL is the issue of 
large/long lasting transactions.  While it might be possible to solve 
this problem with some extra work.  However keep in mind that different 
types of transactions (i.e. normal vs bulk loads) require different 
amounts of time and/or UNDO.  To solve this problem, you really need per 
transaction limits which seems difficult to implement.

I have no doubt that UNDO with WAL can be done.  But is there some other 
way of doing UNDO that might be just as good or better?

Part of what I see in this thread reading between the lines is that some 
believe the solution to many problems in the long term is to implement 
an overwriting storage manager.  Implementing UNDO via WAL is a 
necessary step in that direction.  While others seem to believe that the 
non-overwriting storage manager has some life in it yet, and may even be 
the storage manager for many releases to come.  I don't know enough 
about the internals to have any say in that discussion, however the 
grass isn't always greener on the other side of the fence (i.e. an 
overwriting storage manager will come with its own set of problems/issues).

So let me throw out an idea for UNDO using the current storage manager.  
First let me state that UNDO is a bit of a misnomer, since undo for 
transactions is already implemented.  That is what pg_log is all about.  
The part of UNDO that is missing is savepoints (either explicit or 
implicit), because pg_log doesn't capture the information for each 
command in a transaction.  So the question really becomes, how to 
implement savepoints with current storage manager?

I am going to lay out one assumption that I am making:
1) Most transactions are either completely successful or completely 
rolled back (If this weren't true, i.e. you really needed savepoints to partially 
rollback changes, you couldn't be using the existing version of 
postgresql at all)

My proposal is: 1) create a new relation to store 'failed commands' for transactions.   This is similar to pg_log for
transactions,but takes it to the 
 
command level.  And since it records only failed commands (or ranges of 
failed commands), thus most transactions will not have any information 
in this relation per the assumption above.  2) Use the unused pg_log status (3 = unused, 2 = commit, 1 = abort, 0 
= inprocess) to mean that the transaction was commited but some commands 
were rolled back (i.e. partial commit)  Again for the majority of transactions nothing will need to change, 
since they will still be marked as committed or aborted.  3) Code that determines whether or not a tuple is committed
ornot 
 
needs to be aware of this new pg_log status, and look in the new 
relation to see if the particular command was rolled back or not to 
determine the commited status of the tuple.  This subtly changes the 
meaning of HEAP_XMIN_COMMITTED and related flags to reflect the 
transaction and command status instead of just the transaction status.

The runtime cost of this shouldn't be too high, since the committed 
state is cached in HEAP_XMIN_COMMITTED et al, it is only the added cost 
for the pass that needs to set these flags, and then there is only added 
cost in the case that the transaction wasn't completely sucessful (again 
my assumption above).

Now I have know idea if what I am proposing is really doable or not.  I 
am just throwing this out as an alternative to WAL based 
UNDO/savepoints.  The reason I am doing this is that to me it seems to 
leverage much of the existing infrastructure already in place that 
performs undo for rolledback transactions (all the tmin, tmax, cmin, 
cmax stuff as well as vacuum).  Also it doesn't come with the large WAL 
log file problem for large transactions.

Now having said all of this I realize that this doesn't solve the 4 
billion transaction id limit problem, or the large size of the pg_log 
file with large numbers of transactions.

thanks,
--Barry


> 



Re: RE: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
You can read my Internals paper at the bottom of developers corner.

[ Charset ISO-8859-15 unsupported, converting... ]
> "Henshall, Stuart - WCP" wrote:
> > 
> > Apologises if I've missed something, but isn't that the same xmin that ODBC
> > uses for row versioning?
> > - Stuart
> > 
> >         <Snip>
> > > Currently, the XMIN/XMAX command counters are used only by the current
> > > transaction, and they are useless once the transaction finishes and take
> > > up 8 bytes on disk.
> >         <Snip>
> 
> BTW, is there some place where I could read about exact semantics of
> sytem fields ?
> 
> --------------------
> Hannu
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


XMAX weirdness (was: Plans for solving the VACUUM problem)

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> You can read my Internals paper at the bottom of developers corner.
> 

in answer to my question:

> > BTW, is there some place where I could read about exact semantics of
> > sytem fields ?

the only thing I found was on page 50 which claimed that

xmax - destruction transaction id

which is what I remembered, 

but then no xmax should ever be visible in a regular query :


hannu=# create table parent(pid int primary key);
NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index
'parent_pkey' for table 'parent'
CREATE
hannu=# create table child(cid int references parent(pid) on update
cascade);
NOTICE:  CREATE TABLE will create implicit trigger(s) for FOREIGN KEY
check(s)
CREATE
hannu=# insert into parent values(1);
INSERT 27525 1
hannu=# insert into child values(1);
INSERT 27526 1
hannu=# update parent set pid=2;
UPDATE 1
hannu=# select xmin,xmax,* from parent;xmin | xmax | pid 
------+------+----- 688 |  688 |   2
(1 row)

------------------------
Hannu


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
[ Charset ISO-8859-1 unsupported, converting... ]
> > > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > > what should be implemented for dynamic cleanup of aborted changes.
> > 
> > I do not yet understand why you want to handle aborts different than
> > outdated tuples.
> 
> Maybe because of aborted tuples have shorter Time-To-Live.
> And probability to find pages for them in buffer pool is higher.

This brings up an idea I had about auto-vacuum.  I wonder if autovacuum
could do most of its work by looking at the buffer cache pages and
commit xids.  Seems it would be quite easy record freespace in pages
already in the buffer and collect that information for other backends to
use.  It could also move tuples between cache pages with little
overhead.

There wouldn't be an I/O overhead, and frequently used tables are
already in the cache.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> > The ratio in a well tuned system should well favor outdated tuples.
> > If someone ever adds "dirty read" it is also not the case that it
> > is guaranteed, that nobody accesses the tuple you currently want
> > to undo. So I really miss to see the big difference.
> 
> It will not be guaranteed anyway as soon as we start removing
> tuples without exclusive access to relation.
> 
> And, I cannot say that I would implement UNDO because of
> 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> but because of ALL of 1., 2., 4.

OK, I understand your reasoning here, but I want to make a comment.

Looking at the previous features you added, like subqueries, MVCC, or
WAL, these were major features that greatly enhanced the system's
capabilities.

Now, looking at UNDO, I just don't see it in the same league as those
other additions.  Of course, you can work on whatever you want, but I
was hoping to see another major feature addition for 7.2.  We know we
badly need auto-vacuum, improved replication, and point-in-time recover.

I can see UNDO improving row reuse, and making subtransactions and
pg_log compression easier, but these items do not require UNDO.  

In fact, I am unsure why we would want an UNDO way of reusing rows of
aborted transactions and an autovacuum way of reusing rows from
committed transactions, expecially because aborted transactions account
for <5% of all transactions.  It would be better to put work into one
mechanism that would reuse all tuples.

If UNDO came with no limitations, it may be a good option, but the need
to carry tuples until transaction commit does add an extra burden on
programmers and administrators, and I just don't see what we are getting
for it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
> 
> OK, I understand your reasoning here, but I want to make a comment.
> 
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
> 
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions.  Of course, you can work on whatever you want, but I
> was hoping to see another major feature addition for 7.2.  We know we
> badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for "another major feature in 7.2" anyway.

> It would be better to put work into one mechanism that would
> reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Bruce Momjian wrote:
> 
> > > The ratio in a well tuned system should well favor outdated tuples.
> > > If someone ever adds "dirty read" it is also not the case that it
> > > is guaranteed, that nobody accesses the tuple you currently want
> > > to undo. So I really miss to see the big difference.
> >
> > It will not be guaranteed anyway as soon as we start removing
> > tuples without exclusive access to relation.
> >
> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
> 
> OK, I understand your reasoning here, but I want to make a comment.
> 
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
> 
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions. 

Hmm hasn't it been an agreement ? I know UNDO was planned
for 7.0 and I've never heard objections about it until
recently. People also have referred to an overwriting smgr
easily. Please tell me how to introduce an overwriting smgr
without UNDO.

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> > Looking at the previous features you added, like subqueries, MVCC, or
> > WAL, these were major features that greatly enhanced the system's
> > capabilities.
> > 
> > Now, looking at UNDO, I just don't see it in the same league as those
> > other additions. 
> 
> Hmm hasn't it been an agreement ? I know UNDO was planned
> for 7.0 and I've never heard objections about it until
> recently. People also have referred to an overwriting smgr
> easily. Please tell me how to introduce an overwriting smgr
> without UNDO.

I guess that is the question.  Are we heading for an overwriting storage
manager?  I didn't see that in Vadim's list of UNDO advantages, but
maybe that is his final goal.  If so UNDO may make sense, but then the
question is how do we keep MVCC with an overwriting storage manager?

The only way I can see doing it is to throw the old tuples into the WAL
and have backends read through that for MVCC info.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:

Bruce Momjian wrote:
> 
> > > Looking at the previous features you added, like subqueries, MVCC, or
> > > WAL, these were major features that greatly enhanced the system's
> > > capabilities.
> > >
> > > Now, looking at UNDO, I just don't see it in the same league as those
> > > other additions.
> >
> > Hmm hasn't it been an agreement ? I know UNDO was planned
> > for 7.0 and I've never heard objections about it until
> > recently. People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.
> 
> I guess that is the question.  Are we heading for an overwriting storage
> manager?

I've never heard that it was given up. So there seems to be
at least a possibility to introduce it in the future.
PostgreSQL could have lived without UNDO due to its no
overwrite smgr. I don't know if avoiding UNDO is possible
to implement partial rollback(I don't think it's easy
anyway). However it seems harmful for the future 
implementation of an overwriting smgr if we would
introduce it.

> I didn't see that in Vadim's list of UNDO advantages, but
> maybe that is his final goal.
> If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
> 

It doesn't seem easy. ISTM it's one of the main reason we
couldn't introduce an overwriting smgr in 7.2.

regards,
Hiroshi Inoue


RE: Plans for solving the VACUUM problem

From
Philip Warner
Date:
At 14:33 22/05/01 -0700, Mikheev, Vadim wrote:
>
>If community will not like UNDO then I'll probably try to implement
>dead space collector which will read log files and so on. 

I'd vote for UNDO; in terms of usability & friendliness it's a big win.
Tom's plans for FSM etc are, at least, going to get us some useful data,
and at best will mean we can hang of WAL based FSM for a few versions.




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
Hiroshi Inoue wrote:
> 
> People also have referred to an overwriting smgr easily.

I am all for an overwriting smgr, but as a feature that can be selected 
on a table-by table basis (or at least in compile time), not as an
overall change

> Please tell me how to introduce an overwriting smgr
> without UNDO.

I would much more like a dead-space-reusing smgr on top of MVCC which
does 
not touch live transactions.

------------------
Hannu


Re: XMAX weirdness (was: Plans for solving the VACUUM problem)

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> but then no xmax should ever be visible in a regular query :

Not so.  For example, if a transaction tried to delete a tuple, but is
either still open or rolled back, then other transactions would see its
XID in the tuple's xmax.  Also, SELECT FOR UPDATE uses xmax to record
the XID of the transaction that has the tuple locked --- that's the
case you are seeing, because of the SELECT FOR UPDATE done by the
foreign-key triggers on the table.

There is a claim in the current documentation that xmax is never nonzero
in a visible tuple, but that's incorrect ...
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> I guess that is the question.  Are we heading for an overwriting storage
>> manager?

> I've never heard that it was given up. So there seems to be
> at least a possibility to introduce it in the future.

Unless we want to abandon MVCC (which I don't), I think an overwriting
smgr is impractical.  We need a more complex space-reuse scheme than
that.
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> >> I guess that is the question.  Are we heading for an overwriting storage
> >> manager?
> 
> > I've never heard that it was given up. So there seems to be
> > at least a possibility to introduce it in the future.
> 
> Unless we want to abandon MVCC (which I don't), I think an overwriting
> smgr is impractical. 

Impractical ? Oracle does it.

> We need a more complex space-reuse scheme than
> that.
> 

IMHO we have to decide which to go now.
As I already mentioned, changing current handling
of transactionId/CommandId to avoid UNDO is not
only useless but also harmful for an overwriting
smgr.

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Don Baccus
Date:
At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:

>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>> smgr is impractical. 
>
>Impractical ? Oracle does it.

It's not easy, though ... the current PG scheme has the advantage of being
relatively simple and probably more efficient than scanning logs like
Oracle has to do (assuming your datafiles aren't thoroughly clogged with
old dead tuples).

Has anyone looked at InterBase for hints for space-reusing strategies? 

As I understand it, they have a tuple-versioning scheme similar to PG's.

If nothing else, something might be learned as to the efficiency and
effectiveness of one particular approach to solving the problem.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Don Baccus wrote:
> 
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
> 
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
> 
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
> 

I think so too. I've never said that an overwriting smgr
is easy and I don't love it particularily.

What I'm objecting is to avoid UNDO without giving up
an overwriting smgr. We shouldn't be noncommittal now. 

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Tom Lane wrote:
>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>> smgr is impractical. 

> Impractical ? Oracle does it.

Oracle has MVCC?
        regards, tom lane


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > Tom Lane wrote:
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> 
> > Impractical ? Oracle does it.
> 
> Oracle has MVCC?
> 

Yes.

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Don Baccus
Date:
At 11:02 PM 5/23/01 -0400, Tom Lane wrote:
>Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> Tom Lane wrote:
>>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>>> smgr is impractical. 
>
>> Impractical ? Oracle does it.
>
>Oracle has MVCC?

With restrictions, yes.  You didn't know that?  Vadim did ...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> >> Impractical ? Oracle does it.
> >
> >Oracle has MVCC?
> 
> With restrictions, yes.

What restrictions? Rollback segments size?
Non-overwriting smgr can eat all disk space...

> You didn't know that?  Vadim did ...

Didn't I mention a few times that I was
inspired by Oracle? -:)

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> If PostgreSQL wants to stay MVCC, then we should imho forget
> "overwriting smgr" very fast.
> 
> Let me try to list the pros and cons that I can think of:
> Pro:
>     no index modification if key stays same
>     no search for free space for update (if tuple still
>         fits into page)
>     no pg_log
> Con:
>     additional IO to write "before image" to rollback segment
>         (every before image, not only first after checkpoint)
>         (also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove "before image" (deleted tuple data) from data files.

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> I think so too. I've never said that an overwriting smgr
> is easy and I don't love it particularily.
> 
> What I'm objecting is to avoid UNDO without giving up
> an overwriting smgr. We shouldn't be noncommittal now. 

Why not? We could decide to do overwriting smgr later
and implement UNDO then. For the moment we could just
change checkpointer to use checkpoint.redo instead of
checkpoint.undo when defining what log files should be
deleted - it's a few minutes deal, and so is changing it
back.

Vadim


Re: Plans for solving the VACUUM problem

From
Hiroshi Inoue
Date:
"Mikheev, Vadim" wrote:
> 
> > I think so too. I've never said that an overwriting smgr
> > is easy and I don't love it particularily.
> >
> > What I'm objecting is to avoid UNDO without giving up
> > an overwriting smgr. We shouldn't be noncommittal now.
> 
> Why not? We could decide to do overwriting smgr later
> and implement UNDO then.

What I'm refering to is the discussion about the handling
of subtransactions in order to introduce the savepoints
functionality. Or do we postpone *savepoints* again ?

I realize now few people have had the idea how to switch
to an overwriting smgr. I don't think an overwriting smgr
could be achived at once and we have to prepare one by one
for it.  AFAIK there's no idea how to introduce an overwriting
smgr without UNDO. If we avoid UNDO now when overwriting smgr
would appear ? I also think that the problems Andreas has
specified are pretty serious. I also have known the problems
and I've expected that people have the idea to solve it but
...  I'm inclined to give up an overwriting smgr now.

regards,
Hiroshi Inoue


Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
"Mikheev, Vadim" wrote:
> 
> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
> 
> What restrictions? Rollback segments size?
> Non-overwriting smgr can eat all disk space...

Is'nt the same true for an overwriting smgr ? ;)

> > You didn't know that?  Vadim did ...
> 
> Didn't I mention a few times that I was
> inspired by Oracle? -:)

How does it do MVCC with an overwriting storage manager ?

Could it possibly be a Postgres-inspired bolted-on hack 
needed for better concurrency ?


BTW, are you aware how Interbase does its MVCC - is it more 
like Oracle's way or like PostgreSQL's ?

----------------
Hannu


Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
Don Baccus wrote:
> 
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
> 
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
> 
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
> 
> Has anyone looked at InterBase for hints for space-reusing strategies?
> 
> As I understand it, they have a tuple-versioning scheme similar to PG's.
> 
> If nothing else, something might be learned as to the efficiency and
> effectiveness of one particular approach to solving the problem.

It may also be beneficial to study SapDB (which is IIRC a branch-off of 
Adabas) although they claim at http://www.sapdb.org/ in features
section:

NOT supported features:
             Collations
             Result sets that are created within a stored procedure and
fetched outside. This feature is planned to be             offered in one of the coming releases.
Meanwhile,use temporary tables.             see Reference Manual: SAP DB 7.2 and 7.3 -> Data
 
definition -> CREATE TABLE statement: Owner of a             table
             Multi version concurrency for OLTP             It is available with the object extension of SAPDB only.
             Hot stand by             This feature is planned to be offered in one of the coming
releases. 

So MVCC seems to be a bolt-on feature there.

---------------------
Hannu


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > > >Oracle has MVCC?
> > >
> > > With restrictions, yes.
> > 
> > What restrictions? Rollback segments size?
> > Non-overwriting smgr can eat all disk space...
> 
> Is'nt the same true for an overwriting smgr ? ;)

Removing dead records from rollback segments should
be faster than from datafiles.

> > > You didn't know that?  Vadim did ...
> > 
> > Didn't I mention a few times that I was
> > inspired by Oracle? -:)
> 
> How does it do MVCC with an overwriting storage manager ?

1. System Change Number (SCN) is used: system increments it  on each transaction commit.
2. When scan meets data block with SCN > SCN as it was when  query/transaction started, old block image is restored
usingrollback segments.
 

> Could it possibly be a Postgres-inspired bolted-on hack 
> needed for better concurrency ?

-:)) Oracle has MVCC for years, probably from the beginning
and for sure before Postgres.

> BTW, are you aware how Interbase does its MVCC - is it more 
> like Oracle's way or like PostgreSQL's ?

Like ours.

Vadim


Re: Plans for solving the VACUUM problem

From
Philip Warner
Date:
At 01:51 25/05/01 +0500, Hannu Krosing wrote:
>
>How does it do MVCC with an overwriting storage manager ?
>

I don't know about Oracle, but Dec/RDB also does overwriting and MVCC. It
does this by taking a snapshot of pages that are participating in both RW
and RO transactions (De/RDB has the options on SET TRANSACTION that specify
if the TX will do updates or not). It has the disadvantage that the
snapshot will grow quite large for bulk loads. Typically they are about
10-20% of DB size. Pages are freed from the snapshot as active TXs finish.

Note that the snapshots are separate from the journalling (WAL) and
rollback files.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Plans for solving the VACUUM problem

From
Bruce Momjian
Date:
> What I'm refering to is the discussion about the handling
> of subtransactions in order to introduce the savepoints
> functionality. Or do we postpone *savepoints* again ?
> 
> I realize now few people have had the idea how to switch
> to an overwriting smgr. I don't think an overwriting smgr
> could be achived at once and we have to prepare one by one
> for it.  AFAIK there's no idea how to introduce an overwriting
> smgr without UNDO. If we avoid UNDO now when overwriting smgr
> would appear ? I also think that the problems Andreas has
> specified are pretty serious. I also have known the problems
> and I've expected that people have the idea to solve it but
> ...  I'm inclined to give up an overwriting smgr now.

Now that everyone has commented on the UNDO issue, I thought I would try
to summarize the comments so we can come to some kind of conclusion.

Here are the issues as I see them:

---------------------------------------------------------------------------

Do we want to keep MVCC?

Yes.  No one has said otherwise.

---------------------------------------------------------------------------

Do we want to head for an overwriting storage manager?

Not sure.  

Advantages:  UPDATE has easy space reuse because usually done in-place,
no index change on UPDATE unless key is changed.

Disadvantages:  Old records have to be stored somewhere for MVCC use. 
Could limit transaction size.

---------------------------------------------------------------------------

Do we want UNDO _if_ we are heading for an overwriting storage manager?

Everyone seems to say yes.

---------------------------------------------------------------------------

Do we want UNDO if we are _not_ heading for an overwriting storage
manager?

This is the tough one.  UNDO advantages are:Make subtransactions easier by rolling back aborted subtransaction.
Workaroundis using a new transactions id for each subtransaction.Easy space reuse for aborted transactions.Reduce size
ofpg_log.
 

UNDO disadvantages are:
Limit size of transactions to log storage size.

---------------------------------------------------------------------------

If we are heading for an overwriting storage manager, we may as well get
UNDO now.  If we are not, then we have to decide if we can solve the
problems that UNDO would fix.  Basically, can we solve those problems
easier without UNDO, or are the disadvanges of UNDO too great?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Plans for solving the VACUUM problem

From
Lincoln Yeoh
Date:
At 10:00 AM 24-05-2001 -0700, Mikheev, Vadim wrote:
>> >> Impractical ? Oracle does it.
>> >
>> >Oracle has MVCC?
>> 
>> With restrictions, yes.
>
>What restrictions? Rollback segments size?
>Non-overwriting smgr can eat all disk space...
>
>> You didn't know that?  Vadim did ...
>
>Didn't I mention a few times that I was
>inspired by Oracle? -:)

Is there yet another way to do it, that could be better? Has Oracle
actually done it the best way for once? ;).

But as long as Postgresql doesn't get painted into a corner, it doesn't
matter so much to me - I believe you guys can do a good job (as long as
"it's ready when it's ready", not "when Marketing says so"). 

My worry is if suddenly it is better to do savepoints another way, but it
changes _usage_ and thus breaks apps. Or it makes Postgresql look really
ugly. 

Right now Postgresql is fairly neat/clean with only a few exceptions
(BLOBs, VACUUM). Whereas Oracle is full of so many cases where things were
done wrong but had to be kept that way (and erm why VARCHAR2?). And full of
bits slapped on. So I sure hope postgresql doesn't end up like Oracle.
Because if I want a Frankenstein database I'll go for Oracle. Sure it's
powerful and all that, but it's damn ugly...

Take all the time you want to do it right, coz once Postgresql gets really
popular, your hands will be even more tied. When that happens it's better
to be tied to a nice spot eh?

Cheerio,
Link.



Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
"Mikheev, Vadim" wrote:
> 
> > > > >Oracle has MVCC?
> > > >
> > > > With restrictions, yes.
> > >
> > > What restrictions? Rollback segments size?
> > > Non-overwriting smgr can eat all disk space...
> >
> > Is'nt the same true for an overwriting smgr ? ;)
> 
> Removing dead records from rollback segments should
> be faster than from datafiles.

Is it for better locality or are they stored in a different way ?

Do you think that there is some fundamental performance advantage 
in making a copy to rollback segment and then deleting it from 
there vs. reusing space in datafiles ?

One thing (not having to updata non-changing index entries) can be 
quite substantial under some scenarios, but there are probably ways 
to at least speed up part of this by doing other compromizes, perhaps 
by saving more info in index leaf (trading lookup speed for space 
and insert speed) or chaining data pages (trading insert speed for 
(some) space and lookup speed)

> > > > You didn't know that?  Vadim did ...
> > >
> > > Didn't I mention a few times that I was
> > > inspired by Oracle? -:)
> >
> > How does it do MVCC with an overwriting storage manager ?
> 
> 1. System Change Number (SCN) is used: system increments it
>    on each transaction commit.
> 2. When scan meets data block with SCN > SCN as it was when
>    query/transaction started, old block image is restored
>    using rollback segments.

You mean it is restored in session that is running the transaction ?

I guess thet it could be slower than our current way of doing it.

> > Could it possibly be a Postgres-inspired bolted-on hack
> > needed for better concurrency ?
> 
> -:)) Oracle has MVCC for years, probably from the beginning
> and for sure before Postgres.

In that case we can claim thet their way is more primitive ;) ;)

-----------------
Hannu


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> Do we want to head for an overwriting storage manager?
> 
> Not sure.  
> 
> Advantages:  UPDATE has easy space reuse because usually done
> in-place, no index change on UPDATE unless key is changed.
> 
> Disadvantages:  Old records have to be stored somewhere for MVCC use. 
> Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

> UNDO disadvantages are:
> 
>     Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > > >Oracle has MVCC?
> > > 
> > > With restrictions, yes.
> > 
> > What restrictions? Rollback segments size?
> 
> No, that is not the whole story. The problem with their
> "rollback segment approach" is, that they do not guard against
> overwriting a tuple version in the rollback segment.
> They simply recycle each segment in a wrap around manner.
> Thus there could be an open transaction that still wanted to
> see a tuple version that was already overwritten, leading to the
> feared "snapshot too old" error.
> 
> Copying their "rollback segment" approach is imho the last 
> thing we want to do.

So, they limit size of rollback segments and we don't limit
how big our datafiles may grow if there is some long running
transaction in serializable mode. We could allow our rollback
segments to grow without limits as well.

> > Non-overwriting smgr can eat all disk space...
> > 
> > > You didn't know that?  Vadim did ...
> > 
> > Didn't I mention a few times that I was inspired by Oracle? -:)
> 
> Looking at what they supply in the feature area is imho good.
> Copying their technical architecture is not so good in general.

Copying is not inspiration -:)

Vadim


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > Removing dead records from rollback segments should
> > be faster than from datafiles.
> 
> Is it for better locality or are they stored in a different way ?

Locality - all dead data would be localized in one place.

> Do you think that there is some fundamental performance advantage
> in making a copy to rollback segment and then deleting it from
> there vs. reusing space in datafiles ?

As it showed by WAL additional writes don't mean worse performance.
As for deleting from RS (rollback segment) - we could remove or reuse
RS files as whole.

> > > How does it do MVCC with an overwriting storage manager ?
> > 
> > 1. System Change Number (SCN) is used: system increments it
> >    on each transaction commit.
> > 2. When scan meets data block with SCN > SCN as it was when
> >    query/transaction started, old block image is restored
> >    using rollback segments.
> 
> You mean it is restored in session that is running the transaction ?
> 
> I guess thet it could be slower than our current way of doing it.

Yes, for older transactions which *really* need in *particular*
old data, but not for newer ones. Look - now transactions have to read
dead data again and again, even if some of them (newer) need not to see
those data at all, and we keep dead data as long as required for other
old transactions *just for the case* they will look there.
But who knows?! Maybe those old transactions will not read from table
with big amount of dead data at all! So - why keep dead data in datafiles
for long time? This obviously affects overall system performance.

Vadim


RE: Plans for solving the VACUUM problem

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
> 
> > Do we want to head for an overwriting storage manager?
> > 
> > Not sure.  
> > 
> > Advantages:  UPDATE has easy space reuse because usually done
> > in-place, no index change on UPDATE unless key is changed.
> > 
> > Disadvantages:  Old records have to be stored somewhere for MVCC use. 
> > Could limit transaction size.
> 
> Really? Why is it assumed that we *must* limit size of rollback segments?
> We can let them grow without bounds, as we do now keeping old records in
> datafiles and letting them eat all of disk space.
> 

Is it proper/safe for a DBMS to allow the system eating all disk
space ? For example, could we expect to recover the database
even when no disk space available ?

1) even before WAL    Is 'deleting records and vacuum' always possible ?   I saw the cases that indexes grow by
vacuum.

2) under WAL(current)   If DROP or VACUUM is done after a checkpoint, wouldn't   REDO recovery add the pages
drop/truncatedby the    DROP/VACUUM ?
 
3) with rollback data   Shouldn't WAL log UNDO operations either ?   If so, UNDO requires an extra disk space which
could  be unlimitedly big.
 

There's another serious problem. Once UNDO is required
with a biiiig rollback data, it would take a veeery long time
to undo. It's quite different from the current behavior. Even
though people want to cancel the UNDO, there's no way
unfortunately(under an overwriting smgr).

regards,
Hiroshi Inoue 


RE: Plans for solving the VACUUM problem

From
Don Baccus
Date:
At 10:00 AM 5/24/01 -0700, Mikheev, Vadim wrote:
>> >> Impractical ? Oracle does it.
>> >
>> >Oracle has MVCC?
>> 
>> With restrictions, yes.
>
>What restrictions? Rollback segments size?
>Non-overwriting smgr can eat all disk space...

Actually, the restriction I'm thinking about isn't MVCC related, per
se, but a within-transaction restriction.  The infamous "mutating table"
error.

>> You didn't know that?  Vadim did ...
>
>Didn't I mention a few times that I was
>inspired by Oracle? -:)

Yes, you most certainly have!



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > > Seems overwrite smgr has mainly advantages in terms of
> > > speed for operations other than rollback.
> > 
> > ... And rollback is required for < 5% transactions ...
> 
> This obviously depends on application. 

Small number of aborted transactions was used to show
useless of UNDO in terms of space cleanup - that's why
I use same argument to show usefulness of O-smgr -:)

> I know people who rollback most of their transactions
> (actually they use it to emulate temp tables when reporting). 

Shouldn't they use TEMP tables? -:)

> OTOH it is possible to do without rolling back at all as
> MySQL folks have shown us ;)

Not with SDB tables which support transactions.

Vadim


RE: Plans for solving the VACUUM problem

From
Don Baccus
Date:
At 10:49 AM 5/29/01 -0700, Mikheev, Vadim wrote:

>> I know people who rollback most of their transactions
>> (actually they use it to emulate temp tables when reporting). 
>
>Shouldn't they use TEMP tables? -:)

Which is a very good point.  Pandering to poor practice at the
expense of good performance for better-designed applications
isn't a good idea.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: Plans for solving the VACUUM problem

From
Hannu Krosing
Date:
"Mikheev, Vadim" wrote:
> 
> > I know people who rollback most of their transactions
> > (actually they use it to emulate temp tables when reporting).
> 
> Shouldn't they use TEMP tables? -:)

They probably should.

Actually they did it on Oracle, so it shows that it can be done 
even with O-smgr ;)

> > OTOH it is possible to do without rolling back at all as
> > MySQL folks have shown us ;)
> 
> Not with SDB tables which support transactions.

My point was that MySQL was used quite a long time without it 
and still quite many useful applications were produced.

BTW, do you know what strategy is used by BSDDB/SDB for 
rollback/undo ?

---------------
Hannu


RE: Plans for solving the VACUUM problem

From
"Mikheev, Vadim"
Date:
> > > OTOH it is possible to do without rolling back at all as
> > > MySQL folks have shown us ;)
> > 
> > Not with SDB tables which support transactions.
> 
> My point was that MySQL was used quite a long time without it 
> and still quite many useful applications were produced.

And my point was that needless to talk about rollbacks in
non-transaction system and in transaction system one has to
implement rollback somehow.

> BTW, do you know what strategy is used by BSDDB/SDB for 
> rollback/undo ?

AFAIR, they use O-smgr => UNDO is required.

Vadim