Thread: 'TID index'

'TID index'

From

"Jim C. Nasby"

Date:

15 September 2004, 21:02:35

I just had a thought that could potentially greatly improve vacuum
performance. What about some kind of TID (or does vacuum use CID?)
index? This would allow vacuum to visit only the pages it needs to
visit. Actually, I guess TID/CID wouldn't even be involved; the only
information needed would be if any tuples on a page have been marked
deleted. Something as simple as a bitmap could work. Storing TID info
might provide added vacuum efficiency, but my guess is it's probably not
worth the extra effort.

This might not help much for tables that just see a lot of random update
activity, but I think it would be very useful for large tables where
pages with dead tuples are likely to be a small percentage of the total
number of pages.

Maintaining this information on a per-transaction basis might prove
difficult to do without causing concurrency issues. Luckily, I think
this could probably be done in the background without much difficulty.
One possibility is to check for dead tuples as pages are written to disk
(actually, by definition, there would have to be dead tuples at that
point I would think). If memory serves writing these pages is now a
background process, so this shouldn't cause contention issues.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: 'TID index'

From

"Simon Riggs"

Date:

15 September 2004, 22:40:14

> Jim C. Nasby wrote
> I just had a thought that could potentially greatly improve vacuum
> performance. What about some kind of TID (or does vacuum use CID?)
> index? This would allow vacuum to visit only the pages it needs to
> visit. Actually, I guess TID/CID wouldn't even be involved; the only
> information needed would be if any tuples on a page have been marked
> deleted. Something as simple as a bitmap could work. Storing TID info
> might provide added vacuum efficiency, but my guess is it's probably not
> worth the extra effort.
>
> This might not help much for tables that just see a lot of random update
> activity, but I think it would be very useful for large tables where
> pages with dead tuples are likely to be a small percentage of the total
> number of pages.
>

> Maintaining this information on a per-transaction basis might prove
> difficult to do without causing concurrency issues. Luckily, I think
> this could probably be done in the background without much difficulty.
> One possibility is to check for dead tuples as pages are written to disk
> (actually, by definition, there would have to be dead tuples at that
> point I would think). If memory serves writing these pages is now a
> background process, so this shouldn't cause contention issues.

There are many good ideas out there, yet it is almost impossible to find
somebody else to implement yours!

The acid test is to try and write it...

Overall, I agree VACUUM could do with some tuning - and 8.0 has just that.
It needs very careful thought to make sure both concurrency and
recoverability considerations are fully met in any solution you come up
with.

Best regards, Simon Riggs

Re: 'TID index'

From

"Jim C. Nasby"

Date:

16 September 2004, 02:26:28

On Wed, Sep 15, 2004 at 10:56:28PM +0100, Simon Riggs wrote:
> There are many good ideas out there, yet it is almost impossible to find
> somebody else to implement yours!
> 
> The acid test is to try and write it...
> 
> Overall, I agree VACUUM could do with some tuning - and 8.0 has just that.
> It needs very careful thought to make sure both concurrency and
> recoverability considerations are fully met in any solution you come up
> with.
Heh, I wasn't even thinking of implentation yet. :) I fully understand
the lack of developers.

Unfortunately, I have very little idea on the internals of PGSQL, and
I'm decidedly not a C coder. I *might* be able to get something hacked
up that stores info in a table (since that would mean all the space
management stuff would be handled for me).

If this is a worthwhile idea can we at least get a TODO? Would it be
useful to come up with a high-level design (something I could probably
do)?
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: 'TID index'

From

"Simon Riggs"

Date:

25 September 2004, 10:57:47

>Jim C. Nasby
> On Wed, Sep 15, 2004 at 10:56:28PM +0100, Simon Riggs wrote:
> > There are many good ideas out there, yet it is almost impossible to find
> > somebody else to implement yours!
> >
> > The acid test is to try and write it...
> >
> > Overall, I agree VACUUM could do with some tuning - and 8.0 has
> just that.
> > It needs very careful thought to make sure both concurrency and
> > recoverability considerations are fully met in any solution you come up
> > with.
>
> Before I try and tackle this I wanted to investigate why PostgreSQL's
> MVCC depends on vacuum instead of going with an undo log ala Oracle or
> InnoDB. The only thing I could find is this thread
> (http://tinyurl.com/42opl) which doesn't say all that much.
>
> Has anyone looked at the performance impact of having to vacuum vs a
> method that doesn't recquire vacuuming? Obviously rollback with an undo
> log is slow, but it seems it's better to optimize for the normal
> situation of committing.
>
> I'm not looking for a holy war here but I'd like to know if any analysis
> has happened to determine how efficient vacuum can be made and if it's
> as efficient as just deleting an undo log once you no longer need it.

Fair questions. MVCC has been tightly locked into Postgres/SQL for the whole
of its history. There is much written on this and you should search some
more - references are in the manual.
http://citeseer.ist.psu.edu/cache/papers/cs/4130/http:zSzzSzwuarchive.wustl.
eduzSzpackageszSzpostgreszSzpaperszSzERL-M87-06.pdf/stonebraker87design.pdf

Put simply, MVCC requires vacuum - they are two halves of the overall
mechanism. An Undo log would be an alternative to MVCC and therefore offer
an alternative to vacuum. But there are other things to consider - no other
RDBMS requires a "vacuum", but all of them have SOME requirement for
off-line scanning of the database, whether its ANALYZE, or REORG, or dbcc
(or more than of those...). Removing vacuum won't remove the other
requirements, so there is less to be gained than you might think.

IMHO vacuum could be optimised further in terms of what it does and how it
does it, but many of those optimizations wouldn't apply generally - so you'd
then either be introducing further tweaking requirements for DBAs (which are
frowned upon) or giving yourself the additional task of auto-optimizing
their deployment.

In PostgreSQL 8.0, vacuum has been enhanced so that it operates in the
background, avoids locks, doesn't spoil the dbpage cache - on top of being
automated. That approach is simple and pragmatic and that's hard to beat.

I think the effort of altering the current storage system is not something
you'd have people invest their time in (round here) and would probably
regard even the debate as wasted effort (and remember that is a most closely
guarded resource on the project). PG does allow you to write your own
storage manager...

If you're keen to code, there are many TODO items that are "pre-agreed" as
being required (from earlier discussions). That's probably the best place to
start. Having said that, research into why the code is the way it is can be
interesting and very educational - overall, the code really is written
fairly optimally from a robustness and performance perspective - you need to
be a fair way in to spot opps for improvement. Don't be put off - get stuck
in.

Best Regards, Simon Riggs

Re: 'TID index'

From

"Ross J. Reedstrom"

Date:

26 September 2004, 04:04:22

On Sat, Sep 25, 2004 at 11:14:53AM +0100, Simon Riggs wrote:
> >Jim C. Nasby
> 
> Fair questions. MVCC has been tightly locked into Postgres/SQL for the whole
> of its history. There is much written on this and you should search some
> more - references are in the manual.

Well, not quite it's whole history: MVCC showed up in 6.5. Vacuum's been
there since before SQL. Actually, is a bit of a historical accident. My
understanding of the squence of events is that Hewlett-Packard donated
an early WORM optical drive to the Stonebraker lab. Since it's write
once, it had the beavior that you could only append to files. Someone
thought it might be useful for auditing, etc., so they wrote the first
storage mananger for postgres to accommodate that drive. The other
storage manager at the time was for battery-backed, persistent RAM.
So, all this append-only writing leads to files with lots of dead
tuples, so the vacuum command was added to reclaim space. I think on the
WORM drive, this was supposed to mark blocks 'invisible' in some sense.

I don't know if the WORM drive ever actually got used with postgres.

Ross

-- 
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Research Scientist                                  phone: 713-348-6166
The Connexions Project      http://cnx.rice.edu       fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE

Re: 'TID index'

From

Tom Lane

Date:

26 September 2004, 07:18:04

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:
> ... So, all this append-only writing leads to files with lots of dead
> tuples, so the vacuum command was added to reclaim space.

Actually, I believe the original Postgres design envisioned that no
tuple would ever be permanently deleted: the idea was that you would
always be able to query the database state as of past times as well
as the present instant.  Stonebraker intended to use the WORM drive as
the repository for dead tuples that might be needed to answer such
historical queries.  The "vacuum cleaner" was originally a background
process that pushed dead tuples from magnetic disk to WORM storage.
Its current manifestation is dumbed-down from that, since we only
delete rows and make no attempt to save them somewhere else --- but
it was always an integral part of the system design.

It's quite entertaining to read "The design of the POSTGRES storage
system" (ERL-M87-06, available at http://db.cs.berkeley.edu//papers/)
and compare it to where we are now.  There is just enough similarity
that it's obviously the ancestor of our present code ... but there is
also a lot in that paper that has left *no* trace in our present code.
I would love to know just how much of the paper actually got implemented
and then discarded, and how much never got beyond the arm-waving stage.
        regards, tom lane

Reviving Time Travel (was Re: 'TID index')

From

Hannu Krosing

Date:

27 September 2004, 08:50:57

On P, 2004-09-26 at 09:17, Tom Lane wrote:
> "Ross J. Reedstrom" <reedstrm@rice.edu> writes:
> > ... So, all this append-only writing leads to files with lots of dead
> > tuples, so the vacuum command was added to reclaim space.
> 
> Actually, I believe the original Postgres design envisioned that no
> tuple would ever be permanently deleted: the idea was that you would
> always be able to query the database state as of past times as well
> as the present instant.  Stonebraker intended to use the WORM drive as
> the repository for dead tuples that might be needed to answer such
> historical queries.  The "vacuum cleaner" was originally a background
> process that pushed dead tuples from magnetic disk to WORM storage.

We are now getting back to the point where the "background process" part
is true again - how hard would it be to modify vacuum to write recalimed
tuples to somewhere (network pipe, WORM drive, other DB).

It seems that in addition to writing deleted tuples out to history DB
and making create and delete transaction ids explicitly visible (and do
something(tm) about the transaction id wraparound while at it), the only
thing left to do is some kind of transaction time log - and voila! we
have the original Time Travel feature back - a great feature for
resolving both "audit trail" and "clumsy dba" problems. 

The modern WORM drive equivalent is an IDE(-RAID) disk with its very
tape-like access profile (3+ hours to write full 300GB disk, random
access order(s) of magnitude slower than sequential);

So if someone is looking for a project, this seems to be something that
is both theoretically possible and also theoretically useful ;)

----------------
Hannu

Re: Reviving Time Travel (was Re: 'TID index')

From

"Simon Riggs"

Date:

28 September 2004, 08:08:44

>Hannu Krosing [mailto:hannu@tm.ee]
> On P, 2004-09-26 at 09:17, Tom Lane wrote:
> > "Ross J. Reedstrom" <reedstrm@rice.edu> writes:
> > > ... So, all this append-only writing leads to files with lots of dead
> > > tuples, so the vacuum command was added to reclaim space.
> > 
> > Actually, I believe the original Postgres design envisioned that no
> > tuple would ever be permanently deleted: the idea was that you would
> > always be able to query the database state as of past times as well
> > as the present instant.  Stonebraker intended to use the WORM drive as
> > the repository for dead tuples that might be needed to answer such
> > historical queries.  The "vacuum cleaner" was originally a background
> > process that pushed dead tuples from magnetic disk to WORM storage.
> 
> We are now getting back to the point where the "background process" part
> is true again - how hard would it be to modify vacuum to write recalimed
> tuples to somewhere (network pipe, WORM drive, other DB).
> 
> It seems that in addition to writing deleted tuples out to history DB
> and making create and delete transaction ids explicitly visible (and do
> something(tm) about the transaction id wraparound while at it), the only
> thing left to do is some kind of transaction time log - and voila! we
> have the original Time Travel feature back - a great feature for
> resolving both "audit trail" and "clumsy dba" problems. 
> 
> The modern WORM drive equivalent is an IDE(-RAID) disk with its very
> tape-like access profile (3+ hours to write full 300GB disk, random
> access order(s) of magnitude slower than sequential);
> 
> So if someone is looking for a project, this seems to be something that
> is both theoretically possible and also theoretically useful ;)
> 

Yes, I thought that too - but not for me, not now.

Look here: http://research.microsoft.com/db/ImmortalDB/
for similar

Best Regards, Simon Riggs

Re: Reviving Time Travel (was Re: 'TID index')

From

"Jim C. Nasby"

Date:

30 September 2004, 22:57:47

To answer Simon's earlier comment about if I was looking to start
hacking on PostgreSQL... I'm not. :) I might take a look at the TODO
again, but I seem to do a great job of finding things to put on my plate
as it is. I am interested in minimizing the impact of vacuum, which is
why I brought my idea up originally. So, if anything, I'll be much more
interested to work on improving vacuum than on something else.

Heh, I was about to ask for the pages with dead tuples list to be added
to the TODO, but it seems it's already there ('Maintain a map of
recently-expired rows'). One thing that isn't there which I remember
being discussed was having the page-write daemon do a vacuum of a page
before it's written; has this been done already?
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Reviving Time Travel (was Re: 'TID index')

From

Alvaro Herrera

Date:

30 September 2004, 23:57:38

On Thu, Sep 30, 2004 at 04:57:37PM -0500, Jim C. Nasby wrote:

> Heh, I was about to ask for the pages with dead tuples list to be added
> to the TODO, but it seems it's already there ('Maintain a map of
> recently-expired rows'). One thing that isn't there which I remember
> being discussed was having the page-write daemon do a vacuum of a page
> before it's written; has this been done already?

Not done, but discussed.  One problem is that to vacuum a page, you may
need to bring the related index pages to memory, which is more work than
you are saving.

One idea would be to vacuum the page if it can be determined that the
relation doesn't have indexes.  This information is generally not known,
because the index list isn't constructed until/unless someone explicitly
asks for it.  But anyway, do you have a lot of tables with no index?
And how many of them are big enough to warrant doing this?

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: Reviving Time Travel (was Re: 'TID index')

From

Tom Lane

Date:

01 October 2004, 00:11:49

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> One idea would be to vacuum the page if it can be determined that the
> relation doesn't have indexes.  This information is generally not known,

... especially not by the page writer.  You can't assume that you have
access to the relation descriptor --- for instance, it's entirely
possible that the bgwriter (or another backend) will need to push out a
dirty page that belongs to a newly created relation for which the
catalog data remains uncommitted.  There are related scenarios involving
uncommitted drops.

More generally I think that invoking VACUUM processing from the bgwriter
would be a serious violation of the module hierarchy, and would inflict
more pain in the form of bugs and maintenance headaches than it could
possibly be worth.  We just this version managed to get smgr decoupled
from the relcache, like it should have been all along.  (bufmgr should
be too, but I haven't tackled that yet...)  This was actually a
necessary step to make the separate bgwriter feasible.  Let's not
reverse that cleanup in pursuit of dubious optimizations.
        regards, tom lane

Re: Reviving Time Travel (was Re: 'TID index')

From

"Jim C. Nasby"

Date:

01 October 2004, 01:27:46

On Thu, Sep 30, 2004 at 07:11:29PM -0400, Tom Lane wrote:
> More generally I think that invoking VACUUM processing from the bgwriter
> would be a serious violation of the module hierarchy, and would inflict
> more pain in the form of bugs and maintenance headaches than it could
> possibly be worth.  We just this version managed to get smgr decoupled
> from the relcache, like it should have been all along.  (bufmgr should
> be too, but I haven't tackled that yet...)  This was actually a
> necessary step to make the separate bgwriter feasible.  Let's not
> reverse that cleanup in pursuit of dubious optimizations.

Yeah, I thought about the same thing. It would certainly be more modular
to have a vacuum daemon that runs ahead of the page writer.

As for the indexes, would it be reasonable to see if the required index
pages were already in memory?

Ultimately, it's going to depend on the table and access patterns as to
whether this provides a speed improvement. A table with a lot of indexes
and a lot of updates per page might not see much benefit. A table where
updates are spread across the table would more likely benefit even if it
does have to bring index pages into memory. The alternative is the
vacuum running later, having to bring in the base data page, and then
all the indexes anyway. But once there is a list of pages with dead
tuples only one read and one write would be saved, which probably isn't
worth the extra code.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"