Thread: Resurrecting per-page cleaner for btree

Resurrecting per-page cleaner for btree

From
ITAGAKI Takahiro
Date:
This is a revised patch originated by Junji TERAMOTO for HEAD.
  [BTree vacuum before page splitting]
  http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

Have I missed something? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


Attachment

Re: Resurrecting per-page cleaner for btree

From
Simon Riggs
Date:
On Thu, 2006-07-13 at 10:49 +0900, ITAGAKI Takahiro wrote:
> This is a revised patch originated by Junji TERAMOTO for HEAD.
>   [BTree vacuum before page splitting]
>   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
>
> I think we can resurrect his idea because we will scan btree pages
> at-atime now; the missing-restarting-point problem went away.
>
> Have I missed something? Comments welcome.

The new locking system does allow this now.

So this is an option for us, AFAICS.

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com


Re: Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> This is a revised patch originated by Junji TERAMOTO for HEAD.
>   [BTree vacuum before page splitting]
>   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> I think we can resurrect his idea because we will scan btree pages
> at-atime now; the missing-restarting-point problem went away.
> Have I missed something? Comments welcome.

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one.  The current
form of VACUUM does not care, but people keep raising the idea of doing
"retail" vacuuming that operates by looking up index entries explicitly.
You could certainly make a retail vacuumer do nothing if it fails to
find the expected index entry, but ISTM that'd be a rather serious loss
of consistency checking --- you could not tell the someone-already-
deleted-it case apart from a bug in the vacuumer's index value
computation or lookup.

Personally I don't think retail vacuuming in that form will ever fly
anyway, so I have no problem with installing the proposed patch,
but I thought I'd better throw this comment out to see if anyone
thinks it's a big deal.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Heikki Linnakangas
Date:
On Mon, 24 Jul 2006, Tom Lane wrote:

> Personally I don't think retail vacuuming in that form will ever fly
> anyway, so I have no problem with installing the proposed patch,
> but I thought I'd better throw this comment out to see if anyone
> thinks it's a big deal.

My feeling is that retail vacuuming would be useful some day. But it's
certainly not going to be there in 8.2 so I have no objection with the
patch. It's a fairly localized change; it can easily be reverted later if
necessary.

- Heikki

Re: Resurrecting per-page cleaner for btree

From
Bruce Momjian
Date:
Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > This is a revised patch originated by Junji TERAMOTO for HEAD.
> >   [BTree vacuum before page splitting]
> >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > I think we can resurrect his idea because we will scan btree pages
> > at-atime now; the missing-restarting-point problem went away.
> > Have I missed something? Comments welcome.
>
> I think the only serious objection to this would be that it'd mean that
> tuples that should have an index entry might not have one.  The current
> form of VACUUM does not care, but people keep raising the idea of doing
> "retail" vacuuming that operates by looking up index entries explicitly.
> You could certainly make a retail vacuumer do nothing if it fails to
> find the expected index entry, but ISTM that'd be a rather serious loss
> of consistency checking --- you could not tell the someone-already-
> deleted-it case apart from a bug in the vacuumer's index value
> computation or lookup.
>
> Personally I don't think retail vacuuming in that form will ever fly
> anyway, so I have no problem with installing the proposed patch,
> but I thought I'd better throw this comment out to see if anyone
> thinks it's a big deal.

Agreed.  Reverse lookup of index entries will always be too slow.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Resurrecting per-page cleaner for btree

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> This is a revised patch originated by Junji TERAMOTO for HEAD.
>   [BTree vacuum before page splitting]
>   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
>
> I think we can resurrect his idea because we will scan btree pages
> at-atime now; the missing-restarting-point problem went away.
>
> Have I missed something? Comments welcome.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I think we can resurrect his idea because we will scan btree pages
> at-atime now; the missing-restarting-point problem went away.

> Have I missed something? Comments welcome.

I was thinking for awhile just now that this would break the interlock
that guarantees VACUUM can't delete a heap tuple that an indexscanning
process is about to visit.  After further thought, it doesn't, but it's
non-obvious.  I've added the attached commentary to nbtree/README:


On-the-fly deletion of index tuples
-----------------------------------

If a process visits a heap tuple and finds that it's dead and removable
(ie, dead to all open transactions, not only that process), then we can
return to the index and mark the corresponding index entry "known dead",
allowing subsequent index scans to skip visiting the heap tuple.  The
"known dead" marking uses the LP_DELETE bit in ItemIds.  This is currently
only done in plain indexscans, not bitmap scans, because only plain scans
visit the heap and index "in sync" and so there's not a convenient way
to do it for bitmap scans.

Once an index tuple has been marked LP_DELETE it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.  We separate the steps
because we allow LP_DELETE to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock.  In the current code we try to remove LP_DELETE tuples when
we are otherwise faced with having to split a page to do an insertion (and
hence have exclusive lock on it already).

This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap.  This is not a problem for the current
implementation of VACUUM, but it could be a problem for anything that
explicitly tries to find index entries for dead tuples.  (However, the
same situation is created by REINDEX, since it doesn't enter dead
tuples into the index.)

It's sufficient to have an exclusive lock on the index page, not a
super-exclusive lock, to do deletion of LP_DELETE items.  It might seem
that this breaks the interlock between VACUUM and indexscans, but that is
not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple.  This is another reason why
btbulkdelete has to get super-exclusive lock on every leaf page, not only
the ones where it actually sees items to delete.


            regards, tom lane

Re: Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> This is a revised patch originated by Junji TERAMOTO for HEAD.
>   [BTree vacuum before page splitting]
>   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> I think we can resurrect his idea because we will scan btree pages
> at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code.  Using
    pgbench -i -s 10 bench
    pgbench -c 10 -t 1000 bench    (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-(  Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

    * check for uniqueness marks one more tuple dead (the
      next-to-last version of the tuple)
    * newly added code removes one tuple and does a write
    * now there's enough room to insert one tuple
    * lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert.  This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use?  I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good).  Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(.  But it definitely needs more tweaking.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Bruce Momjian
Date:
The attached patch requires the new row to fit, and 10% to be free on
the page.  Would someone test that?

---------------------------------------------------------------------------

Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > This is a revised patch originated by Junji TERAMOTO for HEAD.
> >   [BTree vacuum before page splitting]
> >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > I think we can resurrect his idea because we will scan btree pages
> > at-atime now; the missing-restarting-point problem went away.
>
> I've applied this but I'm now having some second thoughts about it,
> because I'm seeing an actual *decrease* in pgbench numbers from the
> immediately prior CVS HEAD code.  Using
>     pgbench -i -s 10 bench
>     pgbench -c 10 -t 1000 bench    (repeat this half a dozen times)
> with fsync off but all other settings factory-stock, what I'm seeing
> is that the first run looks really good but subsequent runs tail off in
> spectacular fashion :-(  Pre-patch there was only minor degradation in
> successive runs.
>
> What I think is happening is that because pgbench depends so heavily on
> updating existing records, we get into a state where an index page is
> about full and there's one dead tuple on it, and then for each insertion
> we have
>
>     * check for uniqueness marks one more tuple dead (the
>       next-to-last version of the tuple)
>     * newly added code removes one tuple and does a write
>     * now there's enough room to insert one tuple
>     * lather, rinse, repeat, never splitting the page.
>
> The problem is that we've traded splitting a page every few hundred
> inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
> record, on *every* insert.  This is not good.
>
> Had you done any performance testing on this patch, and if so what
> tests did you use?  I'm a bit hesitant to try to fix it on the basis
> of pgbench results alone.
>
> One possible fix that comes to mind is to only perform the cleanup
> if we are able to remove more than one dead tuple (perhaps about 10
> would be good).  Or do the deletion anyway, but then go ahead and
> split the page unless X amount of space has been freed (where X is
> more than just barely enough for the incoming tuple).
>
> After all the thought we've put into this, it seems a shame to
> just abandon it :-(.  But it definitely needs more tweaking.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.142
diff -c -c -r1.142 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    25 Jul 2006 19:13:00 -0000    1.142
--- src/backend/access/nbtree/nbtinsert.c    26 Jul 2006 01:35:52 -0000
***************
*** 438,445 ****
              if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
              {
                  _bt_vacuum_one_page(rel, buf);
!                 if (PageGetFreeSpace(page) >= itemsz)
!                     break;        /* OK, now we have enough space */
              }

              /*
--- 438,451 ----
              if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
              {
                  _bt_vacuum_one_page(rel, buf);
!                 /*
!                  *    Free space should be large enough for the new tuple and
!                  *    should be >= 10% because scanning the page over and
!                  *    over again to get just a little free space is inefficient.
!                  */
!                 if (PageGetFreeSpace(page) >= itemsz &&
!                     PageGetFreeSpace(page) >= BLCKSZ / 10)
!                     break;
              }

              /*

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Bruce Momjian
Date:
Tom, I ran your tests with fsync off (as you did), and saw numbers
bouncing between 400-700 tps without my patch, and sticking at 700 tps
with my patch.

---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> The attached patch requires the new row to fit, and 10% to be free on
> the page.  Would someone test that?
>
> ---------------------------------------------------------------------------
>
> Tom Lane wrote:
> > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > > This is a revised patch originated by Junji TERAMOTO for HEAD.
> > >   [BTree vacuum before page splitting]
> > >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > > I think we can resurrect his idea because we will scan btree pages
> > > at-atime now; the missing-restarting-point problem went away.
> >
> > I've applied this but I'm now having some second thoughts about it,
> > because I'm seeing an actual *decrease* in pgbench numbers from the
> > immediately prior CVS HEAD code.  Using
> >     pgbench -i -s 10 bench
> >     pgbench -c 10 -t 1000 bench    (repeat this half a dozen times)
> > with fsync off but all other settings factory-stock, what I'm seeing
> > is that the first run looks really good but subsequent runs tail off in
> > spectacular fashion :-(  Pre-patch there was only minor degradation in
> > successive runs.
> >
> > What I think is happening is that because pgbench depends so heavily on
> > updating existing records, we get into a state where an index page is
> > about full and there's one dead tuple on it, and then for each insertion
> > we have
> >
> >     * check for uniqueness marks one more tuple dead (the
> >       next-to-last version of the tuple)
> >     * newly added code removes one tuple and does a write
> >     * now there's enough room to insert one tuple
> >     * lather, rinse, repeat, never splitting the page.
> >
> > The problem is that we've traded splitting a page every few hundred
> > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
> > record, on *every* insert.  This is not good.
> >
> > Had you done any performance testing on this patch, and if so what
> > tests did you use?  I'm a bit hesitant to try to fix it on the basis
> > of pgbench results alone.
> >
> > One possible fix that comes to mind is to only perform the cleanup
> > if we are able to remove more than one dead tuple (perhaps about 10
> > would be good).  Or do the deletion anyway, but then go ahead and
> > split the page unless X amount of space has been freed (where X is
> > more than just barely enough for the incoming tuple).
> >
> > After all the thought we've put into this, it seems a shame to
> > just abandon it :-(.  But it definitely needs more tweaking.
> >
> >             regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
> --
>   Bruce Momjian   bruce@momjian.us
>   EnterpriseDB    http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +


>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Resurrecting per-page cleaner for btree

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > This is a revised patch originated by Junji TERAMOTO for HEAD.
> >   [BTree vacuum before page splitting]
> >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > I think we can resurrect his idea because we will scan btree pages
> > at-atime now; the missing-restarting-point problem went away.
> > Have I missed something? Comments welcome.
>
> I think the only serious objection to this would be that it'd mean that
> tuples that should have an index entry might not have one.  The current
> form of VACUUM does not care, but people keep raising the idea of doing
> "retail" vacuuming that operates by looking up index entries explicitly.
> You could certainly make a retail vacuumer do nothing if it fails to
> find the expected index entry, but ISTM that'd be a rather serious loss
> of consistency checking --- you could not tell the someone-already-
> deleted-it case apart from a bug in the vacuumer's index value
> computation or lookup.

Well you already have that case anyways due to online index builds. (If a
vacuum gets in between the two transactions.) I suppose you could go and check
whether the pg_index entry for the index indicates that it's valid already but
that's not something anyone has to be looking for currently.

> Personally I don't think retail vacuuming in that form will ever fly
> anyway, so I have no problem with installing the proposed patch,
> but I thought I'd better throw this comment out to see if anyone
> thinks it's a big deal.

Well it's not like the existing vacuum checks for this. It's not even
convenient to check for it in the current vacuum architecture. It would have
to maintain the list of expired htids that haven't been found yet in the
bulkdelete and see if there are any left when it's done.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
Gregory Stark <gsstark@mit.edu> writes:
> ... Well it's not like the existing vacuum checks for this.

Right, that's exactly why the patch works at all.  But the point here is
that the existing vacuum does not rely on re-computing index keys; all
it cares about is matching TIDs.  The retail-vacuum idea depends on the
assumption that you can look at the tuple and re-compute the same index
keys that you computed the first time; which is an assumption much
shakier than the assumption that TID comparison works.  (In fact, it's
trivial to see how user-defined functions that are mislabeled immutable
could make this fail.)  So retail vacuum without any cross-check that
you got all the index tuples is a scary proposition IMHO.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Csaba Nagy
Date:
> [snip] (In fact, it's
> trivial to see how user-defined functions that are mislabeled immutable
> could make this fail.)  So retail vacuum without any cross-check that
> you got all the index tuples is a scary proposition IMHO.

Wouldn't work to restrict that kind of vacuum to only tables which have
no indexes using user defined functions ? That would mean a very small
restriction I guess, probably 99.9% of the indexes won't use user
defined functions...

I actually wonder if such a vacuum would be useful for my scenario,
where I have some pretty big tables, and update a relatively small
percentage of it. Would it be faster to run such a vacuum against the
current one ?
One example would be a ~100 million table where I have 1-4 million
updates per day. Could I run vacuum multiple times a day for this table
and expect that individual runs are relatively fast ?

Cheers,
Csaba.



Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:
>> [snip] (In fact, it's
>> trivial to see how user-defined functions that are mislabeled immutable
>> could make this fail.)  So retail vacuum without any cross-check that
>> you got all the index tuples is a scary proposition IMHO.

> Wouldn't work to restrict that kind of vacuum to only tables which have
> no indexes using user defined functions ?

Of course, we never have bugs in PG core.  Nope, doesn't happen ...

> I actually wonder if such a vacuum would be useful for my scenario,
> where I have some pretty big tables, and update a relatively small
> percentage of it. Would it be faster to run such a vacuum against the
> current one ?

So far, the case hasn't been made for retail vacuum even ignoring the
not-so-immutable-function risk.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> So far, the case hasn't been made for retail vacuum even ignoring the
> not-so-immutable-function risk.

Well the desire for it comes from a very well established need for dealing
with extremely large tales with relatively small hot spots. The basic problem
being that currently the cost of vacuum is proportional to the size of the
table rather than the amount of dead space. There's no link between those
variables (at least in one direction) and any time they're far out of whack it
means excruciating pain for the DBA.

The case that perhaps hasn't been made is for whether retail vacuum is the
best solution. The only other candidate that I've seen proposed (many many
times) is some form of segregating the older tuples a la Oracle.

That doesn't mean retail vacuuming is a good idea. It may just be that we
haven't thought of a the best option yet. But it also may be that retail
vacuuming or some kind of rollback segment is just the least bad idea.

--
greg

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Martijn van Oosterhout
Date:
On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > So far, the case hasn't been made for retail vacuum even ignoring the
> > not-so-immutable-function risk.
>
> Well the desire for it comes from a very well established need for dealing
> with extremely large tales with relatively small hot spots. The basic problem
> being that currently the cost of vacuum is proportional to the size of the
> table rather than the amount of dead space. There's no link between those
> variables (at least in one direction) and any time they're far out of whack it
> means excruciating pain for the DBA.

I thought the suggested solution for that was the dead space map. That
way vacuum can ignore parts of the table that havn't changed...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-07-26 kell 23:02, kirjutas Martijn van
Oosterhout:
> On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >
> > > So far, the case hasn't been made for retail vacuum even ignoring the
> > > not-so-immutable-function risk.
> >
> > Well the desire for it comes from a very well established need for dealing
> > with extremely large tales with relatively small hot spots. The basic problem
> > being that currently the cost of vacuum is proportional to the size of the
> > table rather than the amount of dead space. There's no link between those
> > variables (at least in one direction) and any time they're far out of whack it
> > means excruciating pain for the DBA.
>
> I thought the suggested solution for that was the dead space map. That
> way vacuum can ignore parts of the table that havn't changed...

It can ignore parts of the *table* but still has to scan full *indexes*.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


Re: Resurrecting per-page cleaner for btree

From
ITAGAKI Takahiro
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> The attached patch requires the new row to fit, and 10% to be free on
> the page.  Would someone test that?


This is another solution for the same purpose. We can avoid to call
PageIndexMultiDelete() to remove only one tuple. _bt_split() becomes
to ignore tuples with LP_DELETE because such tuples are passed now.
The cleaner threshold (currently 2) might have to be more tweaked.

Can this change resolve your performance problem?


diff -cpr pgsql-orig/src/backend/access/nbtree/nbtinsert.c pgsql/src/backend/access/nbtree/nbtinsert.c
*** pgsql-orig/src/backend/access/nbtree/nbtinsert.c    Wed Jul 26 11:15:20 2006
--- pgsql/src/backend/access/nbtree/nbtinsert.c    Thu Jul 27 18:27:25 2006
*************** _bt_split(Relation rel, Buffer buf, Offs
*** 788,793 ****
--- 788,798 ----
      for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
      {
          itemid = PageGetItemId(origpage, i);
+
+         /* We can ignore items with LP_DELETE. */
+          if (ItemIdDeleted(itemid))
+             continue;
+
          itemsz = ItemIdGetLength(itemid);
          item = (IndexTuple) PageGetItem(origpage, itemid);

*************** _bt_vacuum_one_page(Relation rel, Buffer
*** 1707,1713 ****
              deletable[ndeletable++] = offnum;
      }

!     if (ndeletable > 0)
          _bt_delitems(rel, buffer, deletable, ndeletable);
      /*
       * Note: if we didn't find any LP_DELETE items, then the page's
--- 1712,1723 ----
              deletable[ndeletable++] = offnum;
      }

!     /*
!      * We delete items if there are at least 2 deletable index tuples
!      * because scanning the page over and over again to get just a little
!      * free space is inefficient.
!      */
!     if (ndeletable >= 2)
          _bt_delitems(rel, buffer, deletable, ndeletable);
      /*
       * Note: if we didn't find any LP_DELETE items, then the page's


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Jim Nasby
Date:
On Jul 26, 2006, at 4:29 PM, Hannu Krosing wrote:
>>> Well the desire for it comes from a very well established need
>>> for dealing
>>> with extremely large tales with relatively small hot spots. The
>>> basic problem
>>> being that currently the cost of vacuum is proportional to the
>>> size of the
>>> table rather than the amount of dead space. There's no link
>>> between those
>>> variables (at least in one direction) and any time they're far
>>> out of whack it
>>> means excruciating pain for the DBA.
>>
>> I thought the suggested solution for that was the dead space map.
>> That
>> way vacuum can ignore parts of the table that havn't changed...
>
> It can ignore parts of the *table* but still has to scan full
> *indexes*.

Even if we stopped right there it would still be a huge win in many
(most?) cases. How often do the indexes on a table comprise even 50%
of the table's size? If indexes are 10% of the table's size, and you
only scan 10% of the table that's dead, you've gone from scanning
1.10*X pages (where X is the number of pages in the table) to 0.20*X
pages. For large tables, that's a drastic improvement. Even in the
50% case, you've gone from 1.5X to .6X (assuming 10% dead space). And
I'm actually ignoring that we currently scan the heap twice; if you
throw that into the mix the argument for doing this is even stronger.

While it would be ideal to eliminate the need to scan the indexes,
I'd certainly rather have the 'half-solution' of not scanning the
entire heap and still scanning the indexes over what we have today.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Jim Nasby
Date:
On Jul 26, 2006, at 10:29 AM, Tom Lane wrote:
> Gregory Stark <gsstark@mit.edu> writes:
>> ... Well it's not like the existing vacuum checks for this.
>
> Right, that's exactly why the patch works at all.  But the point
> here is
> that the existing vacuum does not rely on re-computing index keys; all
> it cares about is matching TIDs.  The retail-vacuum idea depends on
> the
> assumption that you can look at the tuple and re-compute the same
> index
> keys that you computed the first time; which is an assumption much
> shakier than the assumption that TID comparison works.  (In fact, it's
> trivial to see how user-defined functions that are mislabeled
> immutable
> could make this fail.)  So retail vacuum without any cross-check that
> you got all the index tuples is a scary proposition IMHO.

Are there other cases that could cause a problem besides mislabeled
user function indexes (a case I don't feel we need to protect
against) and bugs in core?

If we don't worry about users who can't figure out what immutable
means, we should be able to cover core bugs by having the buildfarm
occasionally do long-running pg_bench (or some other workload) runs
with autovacuum turned on. Knowing what autovacuum is set to, we can
calculate approximately how much dead/empty space should exist in
indexes, and make sure we're within reason (though this might mean
having a mode that disables deleting known-dead tuples before
splitting a page). Another possibility would be actually inspecting
the indexes for invalid entries.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Greg Stark
Date:
Jim Nasby <jnasby@pervasive.com> writes:

> Even if we stopped right there it would still be a huge win in many  (most?)
> cases. How often do the indexes on a table comprise even 50%  of the table's
> size?

I would say they're usually roughly comparable actually. It depends on how
wide your table is of course but the wider your table rows the more indexes
you're likely to have on the table too.

> Even in the  50% case, you've gone from 1.5X to .6X

Sure, and a 3x speedup is nothing to sneeze at, that would be a great
improvement to vacuum. But it's still just a linear speedup and doesn't
address the algorithmic problem.

The fundamental problem is we have a process that's O(m) where m is the total
space taken by a table and its indexes. The actual amount of space it has to
reclaim is n. Other than n<m there's basically no relationship between these
figures. As long as that's the case vacuum may as well be O(n^2) or O(n!).

We frequently assume -- and often it's a valid assumption -- that these
figures are roughly proportional. Hence all the talk about databases reaching
a "steady-state" where the amount of dead space is constantly being reclaimed
at more or less the same speed it's being generated. But there are also plenty
of use cases where a complete vacuum pass takes thousands of times longer than
the i/o it took to generate those dead tuples. Currently Postgres just isn't
that great a tool for those use cases.

Unfortunately while I'm convinced of the problem I'm equally unconvinced of
the solution. I tried to solve online index builds using retail index lookups
in a very similar way to what's being discussed here. And ran into the same
problems. I eventually decided that while it could be made to work that way it
would be far too much code, far too unsafe, and far too invasive in the index
access methods to be the right approach.

Our existing method works with minimal help from the index access methods
which allows for an enormous degree of freedom in the index design.. To be
able to support retail vacuum you would have to force index method
implementors to keep information in a way that allowed them to look up a
particular value/tid efficiently which would limit the kinds of indexes you
could support drastically.

--
greg

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
"Jim C. Nasby"
Date:
On Thu, Jul 27, 2006 at 05:24:35PM -0400, Greg Stark wrote:
>
> Jim Nasby <jnasby@pervasive.com> writes:
>
> > Even if we stopped right there it would still be a huge win in many  (most?)
> > cases. How often do the indexes on a table comprise even 50%  of the table's
> > size?
>
> I would say they're usually roughly comparable actually. It depends on how
> wide your table is of course but the wider your table rows the more indexes
> you're likely to have on the table too.

I think the number of fields in a table will correlate with the number
of indexes, but I don't think the width of those fields matters.

> > Even in the  50% case, you've gone from 1.5X to .6X
>
> Sure, and a 3x speedup is nothing to sneeze at, that would be a great
> improvement to vacuum. But it's still just a linear speedup and doesn't
> address the algorithmic problem.
>
> The fundamental problem is we have a process that's O(m) where m is the total
> space taken by a table and its indexes. The actual amount of space it has to
> reclaim is n. Other than n<m there's basically no relationship between these
> figures. As long as that's the case vacuum may as well be O(n^2) or O(n!).
>
> We frequently assume -- and often it's a valid assumption -- that these
> figures are roughly proportional. Hence all the talk about databases reaching
> a "steady-state" where the amount of dead space is constantly being reclaimed
> at more or less the same speed it's being generated. But there are also plenty
> of use cases where a complete vacuum pass takes thousands of times longer than
> the i/o it took to generate those dead tuples. Currently Postgres just isn't
> that great a tool for those use cases.

This is exactly why I'm suggesting that we stop waiting for the perfect
vacuum that will only hit the exact tuples in both the heap and indexes
that it needs to and at least take the giant leap forward of only
hitting heap tuples/pages that we know are dead. Even if indexes are the
same size as the heap, you've still gained nearly a 2x speed improvement
(depending on how long you wait to vacuum).

> Unfortunately while I'm convinced of the problem I'm equally unconvinced of
> the solution. I tried to solve online index builds using retail index lookups
> in a very similar way to what's being discussed here. And ran into the same
> problems. I eventually decided that while it could be made to work that way it
> would be far too much code, far too unsafe, and far too invasive in the index
> access methods to be the right approach.
>
> Our existing method works with minimal help from the index access methods
> which allows for an enormous degree of freedom in the index design.. To be
> able to support retail vacuum you would have to force index method
> implementors to keep information in a way that allowed them to look up a
> particular value/tid efficiently which would limit the kinds of indexes you
> could support drastically.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Bruce Momjian
Date:
Have we made a decision on this issue?  Should I apply my patch that
still forces a split unless 10% of the page has been freed?

---------------------------------------------------------------------------

Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > This is a revised patch originated by Junji TERAMOTO for HEAD.
> >   [BTree vacuum before page splitting]
> >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > I think we can resurrect his idea because we will scan btree pages
> > at-atime now; the missing-restarting-point problem went away.
>
> I've applied this but I'm now having some second thoughts about it,
> because I'm seeing an actual *decrease* in pgbench numbers from the
> immediately prior CVS HEAD code.  Using
>     pgbench -i -s 10 bench
>     pgbench -c 10 -t 1000 bench    (repeat this half a dozen times)
> with fsync off but all other settings factory-stock, what I'm seeing
> is that the first run looks really good but subsequent runs tail off in
> spectacular fashion :-(  Pre-patch there was only minor degradation in
> successive runs.
>
> What I think is happening is that because pgbench depends so heavily on
> updating existing records, we get into a state where an index page is
> about full and there's one dead tuple on it, and then for each insertion
> we have
>
>     * check for uniqueness marks one more tuple dead (the
>       next-to-last version of the tuple)
>     * newly added code removes one tuple and does a write
>     * now there's enough room to insert one tuple
>     * lather, rinse, repeat, never splitting the page.
>
> The problem is that we've traded splitting a page every few hundred
> inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
> record, on *every* insert.  This is not good.
>
> Had you done any performance testing on this patch, and if so what
> tests did you use?  I'm a bit hesitant to try to fix it on the basis
> of pgbench results alone.
>
> One possible fix that comes to mind is to only perform the cleanup
> if we are able to remove more than one dead tuple (perhaps about 10
> would be good).  Or do the deletion anyway, but then go ahead and
> split the page unless X amount of space has been freed (where X is
> more than just barely enough for the incoming tuple).
>
> After all the thought we've put into this, it seems a shame to
> just abandon it :-(.  But it definitely needs more tweaking.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Have we made a decision on this issue?  Should I apply my patch that
> still forces a split unless 10% of the page has been freed?

I haven't gotten back to doing any more performance testing.  Please
stick that patch on the pending queue, so we don't forget it, but
don't apply it yet ...

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for

From
Bruce Momjian
Date:
Tom, should I apply this patch now?  Are you still considering other
options for this?

---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> Tom, I ran your tests with fsync off (as you did), and saw numbers
> bouncing between 400-700 tps without my patch, and sticking at 700 tps
> with my patch.
>
> ---------------------------------------------------------------------------
>
> Bruce Momjian wrote:
> >
> > The attached patch requires the new row to fit, and 10% to be free on
> > the page.  Would someone test that?
> >
> > ---------------------------------------------------------------------------
> >
> > Tom Lane wrote:
> > > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > > > This is a revised patch originated by Junji TERAMOTO for HEAD.
> > > >   [BTree vacuum before page splitting]
> > > >   http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
> > > > I think we can resurrect his idea because we will scan btree pages
> > > > at-atime now; the missing-restarting-point problem went away.
> > >
> > > I've applied this but I'm now having some second thoughts about it,
> > > because I'm seeing an actual *decrease* in pgbench numbers from the
> > > immediately prior CVS HEAD code.  Using
> > >     pgbench -i -s 10 bench
> > >     pgbench -c 10 -t 1000 bench    (repeat this half a dozen times)
> > > with fsync off but all other settings factory-stock, what I'm seeing
> > > is that the first run looks really good but subsequent runs tail off in
> > > spectacular fashion :-(  Pre-patch there was only minor degradation in
> > > successive runs.
> > >
> > > What I think is happening is that because pgbench depends so heavily on
> > > updating existing records, we get into a state where an index page is
> > > about full and there's one dead tuple on it, and then for each insertion
> > > we have
> > >
> > >     * check for uniqueness marks one more tuple dead (the
> > >       next-to-last version of the tuple)
> > >     * newly added code removes one tuple and does a write
> > >     * now there's enough room to insert one tuple
> > >     * lather, rinse, repeat, never splitting the page.
> > >
> > > The problem is that we've traded splitting a page every few hundred
> > > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
> > > record, on *every* insert.  This is not good.
> > >
> > > Had you done any performance testing on this patch, and if so what
> > > tests did you use?  I'm a bit hesitant to try to fix it on the basis
> > > of pgbench results alone.
> > >
> > > One possible fix that comes to mind is to only perform the cleanup
> > > if we are able to remove more than one dead tuple (perhaps about 10
> > > would be good).  Or do the deletion anyway, but then go ahead and
> > > split the page unless X amount of space has been freed (where X is
> > > more than just barely enough for the incoming tuple).
> > >
> > > After all the thought we've put into this, it seems a shame to
> > > just abandon it :-(.  But it definitely needs more tweaking.
> > >
> > >             regards, tom lane
> > >
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 4: Have you searched our list archives?
> > >
> > >                http://archives.postgresql.org
> >
> > --
> >   Bruce Momjian   bruce@momjian.us
> >   EnterpriseDB    http://www.enterprisedb.com
> >
> >   + If your life is a hard drive, Christ can be your backup. +
>
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
>
> --
>   Bruce Momjian   bruce@momjian.us
>   EnterpriseDB    http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Resurrecting per-page cleaner for

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom, should I apply this patch now?  Are you still considering other
> options for this?

Please wait.  This issue is very far down the to-list in terms of size
or significance ... but I'll get to it.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for btree

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> I've applied this but I'm now having some second thoughts about it,
>> because I'm seeing an actual *decrease* in pgbench numbers from the
>> immediately prior CVS HEAD code.

> The attached patch requires the new row to fit, and 10% to be free on
> the page.  Would someone test that?

At the moment, I cannot replicate any consistent difference between
CVS head with the patch, without the patch, with the patch plus your
BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition.
That's whether I use HEAD's broken version of pgbench or one from late
July.  So I'm feeling a tad frustrated ... but I have no evidence in
favor of changing what is in CVS, and accordingly recommend that we
leave well enough alone for 8.2.

            regards, tom lane

Re: [HACKERS] Resurrecting per-page cleaner for

From
Bruce Momjian
Date:
OK, removed from open item list.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> I've applied this but I'm now having some second thoughts about it,
> >> because I'm seeing an actual *decrease* in pgbench numbers from the
> >> immediately prior CVS HEAD code.
>
> > The attached patch requires the new row to fit, and 10% to be free on
> > the page.  Would someone test that?
>
> At the moment, I cannot replicate any consistent difference between
> CVS head with the patch, without the patch, with the patch plus your
> BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition.
> That's whether I use HEAD's broken version of pgbench or one from late
> July.  So I'm feeling a tad frustrated ... but I have no evidence in
> favor of changing what is in CVS, and accordingly recommend that we
> leave well enough alone for 8.2.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +