Thread: Resurrecting per-page cleaner for btree
This is a revised patch originated by Junji TERAMOTO for HEAD. [BTree vacuum before page splitting] http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php I think we can resurrect his idea because we will scan btree pages at-atime now; the missing-restarting-point problem went away. Have I missed something? Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On Thu, 2006-07-13 at 10:49 +0900, ITAGAKI Takahiro wrote: > This is a revised patch originated by Junji TERAMOTO for HEAD. > [BTree vacuum before page splitting] > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away. > > Have I missed something? Comments welcome. The new locking system does allow this now. So this is an option for us, AFAICS. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > This is a revised patch originated by Junji TERAMOTO for HEAD. > [BTree vacuum before page splitting] > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away. > Have I missed something? Comments welcome. I think the only serious objection to this would be that it'd mean that tuples that should have an index entry might not have one. The current form of VACUUM does not care, but people keep raising the idea of doing "retail" vacuuming that operates by looking up index entries explicitly. You could certainly make a retail vacuumer do nothing if it fails to find the expected index entry, but ISTM that'd be a rather serious loss of consistency checking --- you could not tell the someone-already- deleted-it case apart from a bug in the vacuumer's index value computation or lookup. Personally I don't think retail vacuuming in that form will ever fly anyway, so I have no problem with installing the proposed patch, but I thought I'd better throw this comment out to see if anyone thinks it's a big deal. regards, tom lane
On Mon, 24 Jul 2006, Tom Lane wrote: > Personally I don't think retail vacuuming in that form will ever fly > anyway, so I have no problem with installing the proposed patch, > but I thought I'd better throw this comment out to see if anyone > thinks it's a big deal. My feeling is that retail vacuuming would be useful some day. But it's certainly not going to be there in 8.2 so I have no objection with the patch. It's a fairly localized change; it can easily be reverted later if necessary. - Heikki
Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > [BTree vacuum before page splitting] > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > > at-atime now; the missing-restarting-point problem went away. > > Have I missed something? Comments welcome. > > I think the only serious objection to this would be that it'd mean that > tuples that should have an index entry might not have one. The current > form of VACUUM does not care, but people keep raising the idea of doing > "retail" vacuuming that operates by looking up index entries explicitly. > You could certainly make a retail vacuumer do nothing if it fails to > find the expected index entry, but ISTM that'd be a rather serious loss > of consistency checking --- you could not tell the someone-already- > deleted-it case apart from a bug in the vacuumer's index value > computation or lookup. > > Personally I don't think retail vacuuming in that form will ever fly > anyway, so I have no problem with installing the proposed patch, > but I thought I'd better throw this comment out to see if anyone > thinks it's a big deal. Agreed. Reverse lookup of index entries will always be too slow. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > This is a revised patch originated by Junji TERAMOTO for HEAD. > [BTree vacuum before page splitting] > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away. > > Have I missed something? Comments welcome. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away. > Have I missed something? Comments welcome. I was thinking for awhile just now that this would break the interlock that guarantees VACUUM can't delete a heap tuple that an indexscanning process is about to visit. After further thought, it doesn't, but it's non-obvious. I've added the attached commentary to nbtree/README: On-the-fly deletion of index tuples ----------------------------------- If a process visits a heap tuple and finds that it's dead and removable (ie, dead to all open transactions, not only that process), then we can return to the index and mark the corresponding index entry "known dead", allowing subsequent index scans to skip visiting the heap tuple. The "known dead" marking uses the LP_DELETE bit in ItemIds. This is currently only done in plain indexscans, not bitmap scans, because only plain scans visit the heap and index "in sync" and so there's not a convenient way to do it for bitmap scans. Once an index tuple has been marked LP_DELETE it can actually be removed from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DELETE to be set with only a share lock (it's exactly like a hint bit for a heap tuple), but physically removing tuples requires exclusive lock. In the current code we try to remove LP_DELETE tuples when we are otherwise faced with having to split a page to do an insertion (and hence have exclusive lock on it already). This leaves the index in a state where it has no entry for a dead tuple that still exists in the heap. This is not a problem for the current implementation of VACUUM, but it could be a problem for anything that explicitly tries to find index entries for dead tuples. (However, the same situation is created by REINDEX, since it doesn't enter dead tuples into the index.) It's sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DELETE items. It might seem that this breaks the interlock between VACUUM and indexscans, but that is not so: as long as an indexscanning process has a pin on the page where the index item used to be, VACUUM cannot complete its btbulkdelete scan and so cannot remove the heap tuple. This is another reason why btbulkdelete has to get super-exclusive lock on every leaf page, not only the ones where it actually sees items to delete. regards, tom lane
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > This is a revised patch originated by Junji TERAMOTO for HEAD. > [BTree vacuum before page splitting] > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away. I've applied this but I'm now having some second thoughts about it, because I'm seeing an actual *decrease* in pgbench numbers from the immediately prior CVS HEAD code. Using pgbench -i -s 10 bench pgbench -c 10 -t 1000 bench (repeat this half a dozen times) with fsync off but all other settings factory-stock, what I'm seeing is that the first run looks really good but subsequent runs tail off in spectacular fashion :-( Pre-patch there was only minor degradation in successive runs. What I think is happening is that because pgbench depends so heavily on updating existing records, we get into a state where an index page is about full and there's one dead tuple on it, and then for each insertion we have * check for uniqueness marks one more tuple dead (the next-to-last version of the tuple) * newly added code removes one tuple and does a write * now there's enough room to insert one tuple * lather, rinse, repeat, never splitting the page. The problem is that we've traded splitting a page every few hundred inserts for doing a PageIndexMultiDelete, and emitting an extra WAL record, on *every* insert. This is not good. Had you done any performance testing on this patch, and if so what tests did you use? I'm a bit hesitant to try to fix it on the basis of pgbench results alone. One possible fix that comes to mind is to only perform the cleanup if we are able to remove more than one dead tuple (perhaps about 10 would be good). Or do the deletion anyway, but then go ahead and split the page unless X amount of space has been freed (where X is more than just barely enough for the incoming tuple). After all the thought we've put into this, it seems a shame to just abandon it :-(. But it definitely needs more tweaking. regards, tom lane
The attached patch requires the new row to fit, and 10% to be free on the page. Would someone test that? --------------------------------------------------------------------------- Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > [BTree vacuum before page splitting] > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > > at-atime now; the missing-restarting-point problem went away. > > I've applied this but I'm now having some second thoughts about it, > because I'm seeing an actual *decrease* in pgbench numbers from the > immediately prior CVS HEAD code. Using > pgbench -i -s 10 bench > pgbench -c 10 -t 1000 bench (repeat this half a dozen times) > with fsync off but all other settings factory-stock, what I'm seeing > is that the first run looks really good but subsequent runs tail off in > spectacular fashion :-( Pre-patch there was only minor degradation in > successive runs. > > What I think is happening is that because pgbench depends so heavily on > updating existing records, we get into a state where an index page is > about full and there's one dead tuple on it, and then for each insertion > we have > > * check for uniqueness marks one more tuple dead (the > next-to-last version of the tuple) > * newly added code removes one tuple and does a write > * now there's enough room to insert one tuple > * lather, rinse, repeat, never splitting the page. > > The problem is that we've traded splitting a page every few hundred > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL > record, on *every* insert. This is not good. > > Had you done any performance testing on this patch, and if so what > tests did you use? I'm a bit hesitant to try to fix it on the basis > of pgbench results alone. > > One possible fix that comes to mind is to only perform the cleanup > if we are able to remove more than one dead tuple (perhaps about 10 > would be good). Or do the deletion anyway, but then go ahead and > split the page unless X amount of space has been freed (where X is > more than just barely enough for the incoming tuple). > > After all the thought we've put into this, it seems a shame to > just abandon it :-(. But it definitely needs more tweaking. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: src/backend/access/nbtree/nbtinsert.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v retrieving revision 1.142 diff -c -c -r1.142 nbtinsert.c *** src/backend/access/nbtree/nbtinsert.c 25 Jul 2006 19:13:00 -0000 1.142 --- src/backend/access/nbtree/nbtinsert.c 26 Jul 2006 01:35:52 -0000 *************** *** 438,445 **** if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop)) { _bt_vacuum_one_page(rel, buf); ! if (PageGetFreeSpace(page) >= itemsz) ! break; /* OK, now we have enough space */ } /* --- 438,451 ---- if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop)) { _bt_vacuum_one_page(rel, buf); ! /* ! * Free space should be large enough for the new tuple and ! * should be >= 10% because scanning the page over and ! * over again to get just a little free space is inefficient. ! */ ! if (PageGetFreeSpace(page) >= itemsz && ! PageGetFreeSpace(page) >= BLCKSZ / 10) ! break; } /*
Tom, I ran your tests with fsync off (as you did), and saw numbers bouncing between 400-700 tps without my patch, and sticking at 700 tps with my patch. --------------------------------------------------------------------------- Bruce Momjian wrote: > > The attached patch requires the new row to fit, and 10% to be free on > the page. Would someone test that? > > --------------------------------------------------------------------------- > > Tom Lane wrote: > > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > > [BTree vacuum before page splitting] > > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > > I think we can resurrect his idea because we will scan btree pages > > > at-atime now; the missing-restarting-point problem went away. > > > > I've applied this but I'm now having some second thoughts about it, > > because I'm seeing an actual *decrease* in pgbench numbers from the > > immediately prior CVS HEAD code. Using > > pgbench -i -s 10 bench > > pgbench -c 10 -t 1000 bench (repeat this half a dozen times) > > with fsync off but all other settings factory-stock, what I'm seeing > > is that the first run looks really good but subsequent runs tail off in > > spectacular fashion :-( Pre-patch there was only minor degradation in > > successive runs. > > > > What I think is happening is that because pgbench depends so heavily on > > updating existing records, we get into a state where an index page is > > about full and there's one dead tuple on it, and then for each insertion > > we have > > > > * check for uniqueness marks one more tuple dead (the > > next-to-last version of the tuple) > > * newly added code removes one tuple and does a write > > * now there's enough room to insert one tuple > > * lather, rinse, repeat, never splitting the page. > > > > The problem is that we've traded splitting a page every few hundred > > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL > > record, on *every* insert. This is not good. > > > > Had you done any performance testing on this patch, and if so what > > tests did you use? I'm a bit hesitant to try to fix it on the basis > > of pgbench results alone. > > > > One possible fix that comes to mind is to only perform the cleanup > > if we are able to remove more than one dead tuple (perhaps about 10 > > would be good). Or do the deletion anyway, but then go ahead and > > split the page unless X amount of space has been freed (where X is > > more than just barely enough for the incoming tuple). > > > > After all the thought we've put into this, it seems a shame to > > just abandon it :-(. But it definitely needs more tweaking. > > > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org > > -- > Bruce Momjian bruce@momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane <tgl@sss.pgh.pa.us> writes: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > [BTree vacuum before page splitting] > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > > at-atime now; the missing-restarting-point problem went away. > > Have I missed something? Comments welcome. > > I think the only serious objection to this would be that it'd mean that > tuples that should have an index entry might not have one. The current > form of VACUUM does not care, but people keep raising the idea of doing > "retail" vacuuming that operates by looking up index entries explicitly. > You could certainly make a retail vacuumer do nothing if it fails to > find the expected index entry, but ISTM that'd be a rather serious loss > of consistency checking --- you could not tell the someone-already- > deleted-it case apart from a bug in the vacuumer's index value > computation or lookup. Well you already have that case anyways due to online index builds. (If a vacuum gets in between the two transactions.) I suppose you could go and check whether the pg_index entry for the index indicates that it's valid already but that's not something anyone has to be looking for currently. > Personally I don't think retail vacuuming in that form will ever fly > anyway, so I have no problem with installing the proposed patch, > but I thought I'd better throw this comment out to see if anyone > thinks it's a big deal. Well it's not like the existing vacuum checks for this. It's not even convenient to check for it in the current vacuum architecture. It would have to maintain the list of expired htids that haven't been found yet in the bulkdelete and see if there are any left when it's done. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes: > ... Well it's not like the existing vacuum checks for this. Right, that's exactly why the patch works at all. But the point here is that the existing vacuum does not rely on re-computing index keys; all it cares about is matching TIDs. The retail-vacuum idea depends on the assumption that you can look at the tuple and re-compute the same index keys that you computed the first time; which is an assumption much shakier than the assumption that TID comparison works. (In fact, it's trivial to see how user-defined functions that are mislabeled immutable could make this fail.) So retail vacuum without any cross-check that you got all the index tuples is a scary proposition IMHO. regards, tom lane
> [snip] (In fact, it's > trivial to see how user-defined functions that are mislabeled immutable > could make this fail.) So retail vacuum without any cross-check that > you got all the index tuples is a scary proposition IMHO. Wouldn't work to restrict that kind of vacuum to only tables which have no indexes using user defined functions ? That would mean a very small restriction I guess, probably 99.9% of the indexes won't use user defined functions... I actually wonder if such a vacuum would be useful for my scenario, where I have some pretty big tables, and update a relatively small percentage of it. Would it be faster to run such a vacuum against the current one ? One example would be a ~100 million table where I have 1-4 million updates per day. Could I run vacuum multiple times a day for this table and expect that individual runs are relatively fast ? Cheers, Csaba.
Csaba Nagy <nagy@ecircle-ag.com> writes: >> [snip] (In fact, it's >> trivial to see how user-defined functions that are mislabeled immutable >> could make this fail.) So retail vacuum without any cross-check that >> you got all the index tuples is a scary proposition IMHO. > Wouldn't work to restrict that kind of vacuum to only tables which have > no indexes using user defined functions ? Of course, we never have bugs in PG core. Nope, doesn't happen ... > I actually wonder if such a vacuum would be useful for my scenario, > where I have some pretty big tables, and update a relatively small > percentage of it. Would it be faster to run such a vacuum against the > current one ? So far, the case hasn't been made for retail vacuum even ignoring the not-so-immutable-function risk. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > So far, the case hasn't been made for retail vacuum even ignoring the > not-so-immutable-function risk. Well the desire for it comes from a very well established need for dealing with extremely large tales with relatively small hot spots. The basic problem being that currently the cost of vacuum is proportional to the size of the table rather than the amount of dead space. There's no link between those variables (at least in one direction) and any time they're far out of whack it means excruciating pain for the DBA. The case that perhaps hasn't been made is for whether retail vacuum is the best solution. The only other candidate that I've seen proposed (many many times) is some form of segregating the older tuples a la Oracle. That doesn't mean retail vacuuming is a good idea. It may just be that we haven't thought of a the best option yet. But it also may be that retail vacuuming or some kind of rollback segment is just the least bad idea. -- greg
On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > So far, the case hasn't been made for retail vacuum even ignoring the > > not-so-immutable-function risk. > > Well the desire for it comes from a very well established need for dealing > with extremely large tales with relatively small hot spots. The basic problem > being that currently the cost of vacuum is proportional to the size of the > table rather than the amount of dead space. There's no link between those > variables (at least in one direction) and any time they're far out of whack it > means excruciating pain for the DBA. I thought the suggested solution for that was the dead space map. That way vacuum can ignore parts of the table that havn't changed... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Ühel kenal päeval, K, 2006-07-26 kell 23:02, kirjutas Martijn van Oosterhout: > On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > > So far, the case hasn't been made for retail vacuum even ignoring the > > > not-so-immutable-function risk. > > > > Well the desire for it comes from a very well established need for dealing > > with extremely large tales with relatively small hot spots. The basic problem > > being that currently the cost of vacuum is proportional to the size of the > > table rather than the amount of dead space. There's no link between those > > variables (at least in one direction) and any time they're far out of whack it > > means excruciating pain for the DBA. > > I thought the suggested solution for that was the dead space map. That > way vacuum can ignore parts of the table that havn't changed... It can ignore parts of the *table* but still has to scan full *indexes*. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Bruce Momjian <bruce@momjian.us> wrote: > The attached patch requires the new row to fit, and 10% to be free on > the page. Would someone test that? This is another solution for the same purpose. We can avoid to call PageIndexMultiDelete() to remove only one tuple. _bt_split() becomes to ignore tuples with LP_DELETE because such tuples are passed now. The cleaner threshold (currently 2) might have to be more tweaked. Can this change resolve your performance problem? diff -cpr pgsql-orig/src/backend/access/nbtree/nbtinsert.c pgsql/src/backend/access/nbtree/nbtinsert.c *** pgsql-orig/src/backend/access/nbtree/nbtinsert.c Wed Jul 26 11:15:20 2006 --- pgsql/src/backend/access/nbtree/nbtinsert.c Thu Jul 27 18:27:25 2006 *************** _bt_split(Relation rel, Buffer buf, Offs *** 788,793 **** --- 788,798 ---- for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i)) { itemid = PageGetItemId(origpage, i); + + /* We can ignore items with LP_DELETE. */ + if (ItemIdDeleted(itemid)) + continue; + itemsz = ItemIdGetLength(itemid); item = (IndexTuple) PageGetItem(origpage, itemid); *************** _bt_vacuum_one_page(Relation rel, Buffer *** 1707,1713 **** deletable[ndeletable++] = offnum; } ! if (ndeletable > 0) _bt_delitems(rel, buffer, deletable, ndeletable); /* * Note: if we didn't find any LP_DELETE items, then the page's --- 1712,1723 ---- deletable[ndeletable++] = offnum; } ! /* ! * We delete items if there are at least 2 deletable index tuples ! * because scanning the page over and over again to get just a little ! * free space is inefficient. ! */ ! if (ndeletable >= 2) _bt_delitems(rel, buffer, deletable, ndeletable); /* * Note: if we didn't find any LP_DELETE items, then the page's Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Jul 26, 2006, at 4:29 PM, Hannu Krosing wrote: >>> Well the desire for it comes from a very well established need >>> for dealing >>> with extremely large tales with relatively small hot spots. The >>> basic problem >>> being that currently the cost of vacuum is proportional to the >>> size of the >>> table rather than the amount of dead space. There's no link >>> between those >>> variables (at least in one direction) and any time they're far >>> out of whack it >>> means excruciating pain for the DBA. >> >> I thought the suggested solution for that was the dead space map. >> That >> way vacuum can ignore parts of the table that havn't changed... > > It can ignore parts of the *table* but still has to scan full > *indexes*. Even if we stopped right there it would still be a huge win in many (most?) cases. How often do the indexes on a table comprise even 50% of the table's size? If indexes are 10% of the table's size, and you only scan 10% of the table that's dead, you've gone from scanning 1.10*X pages (where X is the number of pages in the table) to 0.20*X pages. For large tables, that's a drastic improvement. Even in the 50% case, you've gone from 1.5X to .6X (assuming 10% dead space). And I'm actually ignoring that we currently scan the heap twice; if you throw that into the mix the argument for doing this is even stronger. While it would be ideal to eliminate the need to scan the indexes, I'd certainly rather have the 'half-solution' of not scanning the entire heap and still scanning the indexes over what we have today. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Jul 26, 2006, at 10:29 AM, Tom Lane wrote: > Gregory Stark <gsstark@mit.edu> writes: >> ... Well it's not like the existing vacuum checks for this. > > Right, that's exactly why the patch works at all. But the point > here is > that the existing vacuum does not rely on re-computing index keys; all > it cares about is matching TIDs. The retail-vacuum idea depends on > the > assumption that you can look at the tuple and re-compute the same > index > keys that you computed the first time; which is an assumption much > shakier than the assumption that TID comparison works. (In fact, it's > trivial to see how user-defined functions that are mislabeled > immutable > could make this fail.) So retail vacuum without any cross-check that > you got all the index tuples is a scary proposition IMHO. Are there other cases that could cause a problem besides mislabeled user function indexes (a case I don't feel we need to protect against) and bugs in core? If we don't worry about users who can't figure out what immutable means, we should be able to cover core bugs by having the buildfarm occasionally do long-running pg_bench (or some other workload) runs with autovacuum turned on. Knowing what autovacuum is set to, we can calculate approximately how much dead/empty space should exist in indexes, and make sure we're within reason (though this might mean having a mode that disables deleting known-dead tuples before splitting a page). Another possibility would be actually inspecting the indexes for invalid entries. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim Nasby <jnasby@pervasive.com> writes: > Even if we stopped right there it would still be a huge win in many (most?) > cases. How often do the indexes on a table comprise even 50% of the table's > size? I would say they're usually roughly comparable actually. It depends on how wide your table is of course but the wider your table rows the more indexes you're likely to have on the table too. > Even in the 50% case, you've gone from 1.5X to .6X Sure, and a 3x speedup is nothing to sneeze at, that would be a great improvement to vacuum. But it's still just a linear speedup and doesn't address the algorithmic problem. The fundamental problem is we have a process that's O(m) where m is the total space taken by a table and its indexes. The actual amount of space it has to reclaim is n. Other than n<m there's basically no relationship between these figures. As long as that's the case vacuum may as well be O(n^2) or O(n!). We frequently assume -- and often it's a valid assumption -- that these figures are roughly proportional. Hence all the talk about databases reaching a "steady-state" where the amount of dead space is constantly being reclaimed at more or less the same speed it's being generated. But there are also plenty of use cases where a complete vacuum pass takes thousands of times longer than the i/o it took to generate those dead tuples. Currently Postgres just isn't that great a tool for those use cases. Unfortunately while I'm convinced of the problem I'm equally unconvinced of the solution. I tried to solve online index builds using retail index lookups in a very similar way to what's being discussed here. And ran into the same problems. I eventually decided that while it could be made to work that way it would be far too much code, far too unsafe, and far too invasive in the index access methods to be the right approach. Our existing method works with minimal help from the index access methods which allows for an enormous degree of freedom in the index design.. To be able to support retail vacuum you would have to force index method implementors to keep information in a way that allowed them to look up a particular value/tid efficiently which would limit the kinds of indexes you could support drastically. -- greg
On Thu, Jul 27, 2006 at 05:24:35PM -0400, Greg Stark wrote: > > Jim Nasby <jnasby@pervasive.com> writes: > > > Even if we stopped right there it would still be a huge win in many (most?) > > cases. How often do the indexes on a table comprise even 50% of the table's > > size? > > I would say they're usually roughly comparable actually. It depends on how > wide your table is of course but the wider your table rows the more indexes > you're likely to have on the table too. I think the number of fields in a table will correlate with the number of indexes, but I don't think the width of those fields matters. > > Even in the 50% case, you've gone from 1.5X to .6X > > Sure, and a 3x speedup is nothing to sneeze at, that would be a great > improvement to vacuum. But it's still just a linear speedup and doesn't > address the algorithmic problem. > > The fundamental problem is we have a process that's O(m) where m is the total > space taken by a table and its indexes. The actual amount of space it has to > reclaim is n. Other than n<m there's basically no relationship between these > figures. As long as that's the case vacuum may as well be O(n^2) or O(n!). > > We frequently assume -- and often it's a valid assumption -- that these > figures are roughly proportional. Hence all the talk about databases reaching > a "steady-state" where the amount of dead space is constantly being reclaimed > at more or less the same speed it's being generated. But there are also plenty > of use cases where a complete vacuum pass takes thousands of times longer than > the i/o it took to generate those dead tuples. Currently Postgres just isn't > that great a tool for those use cases. This is exactly why I'm suggesting that we stop waiting for the perfect vacuum that will only hit the exact tuples in both the heap and indexes that it needs to and at least take the giant leap forward of only hitting heap tuples/pages that we know are dead. Even if indexes are the same size as the heap, you've still gained nearly a 2x speed improvement (depending on how long you wait to vacuum). > Unfortunately while I'm convinced of the problem I'm equally unconvinced of > the solution. I tried to solve online index builds using retail index lookups > in a very similar way to what's being discussed here. And ran into the same > problems. I eventually decided that while it could be made to work that way it > would be far too much code, far too unsafe, and far too invasive in the index > access methods to be the right approach. > > Our existing method works with minimal help from the index access methods > which allows for an enormous degree of freedom in the index design.. To be > able to support retail vacuum you would have to force index method > implementors to keep information in a way that allowed them to look up a > particular value/tid efficiently which would limit the kinds of indexes you > could support drastically. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Have we made a decision on this issue? Should I apply my patch that still forces a split unless 10% of the page has been freed? --------------------------------------------------------------------------- Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > [BTree vacuum before page splitting] > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > I think we can resurrect his idea because we will scan btree pages > > at-atime now; the missing-restarting-point problem went away. > > I've applied this but I'm now having some second thoughts about it, > because I'm seeing an actual *decrease* in pgbench numbers from the > immediately prior CVS HEAD code. Using > pgbench -i -s 10 bench > pgbench -c 10 -t 1000 bench (repeat this half a dozen times) > with fsync off but all other settings factory-stock, what I'm seeing > is that the first run looks really good but subsequent runs tail off in > spectacular fashion :-( Pre-patch there was only minor degradation in > successive runs. > > What I think is happening is that because pgbench depends so heavily on > updating existing records, we get into a state where an index page is > about full and there's one dead tuple on it, and then for each insertion > we have > > * check for uniqueness marks one more tuple dead (the > next-to-last version of the tuple) > * newly added code removes one tuple and does a write > * now there's enough room to insert one tuple > * lather, rinse, repeat, never splitting the page. > > The problem is that we've traded splitting a page every few hundred > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL > record, on *every* insert. This is not good. > > Had you done any performance testing on this patch, and if so what > tests did you use? I'm a bit hesitant to try to fix it on the basis > of pgbench results alone. > > One possible fix that comes to mind is to only perform the cleanup > if we are able to remove more than one dead tuple (perhaps about 10 > would be good). Or do the deletion anyway, but then go ahead and > split the page unless X amount of space has been freed (where X is > more than just barely enough for the incoming tuple). > > After all the thought we've put into this, it seems a shame to > just abandon it :-(. But it definitely needs more tweaking. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Have we made a decision on this issue? Should I apply my patch that > still forces a split unless 10% of the page has been freed? I haven't gotten back to doing any more performance testing. Please stick that patch on the pending queue, so we don't forget it, but don't apply it yet ... regards, tom lane
Tom, should I apply this patch now? Are you still considering other options for this? --------------------------------------------------------------------------- Bruce Momjian wrote: > > Tom, I ran your tests with fsync off (as you did), and saw numbers > bouncing between 400-700 tps without my patch, and sticking at 700 tps > with my patch. > > --------------------------------------------------------------------------- > > Bruce Momjian wrote: > > > > The attached patch requires the new row to fit, and 10% to be free on > > the page. Would someone test that? > > > > --------------------------------------------------------------------------- > > > > Tom Lane wrote: > > > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > > > This is a revised patch originated by Junji TERAMOTO for HEAD. > > > > [BTree vacuum before page splitting] > > > > http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php > > > > I think we can resurrect his idea because we will scan btree pages > > > > at-atime now; the missing-restarting-point problem went away. > > > > > > I've applied this but I'm now having some second thoughts about it, > > > because I'm seeing an actual *decrease* in pgbench numbers from the > > > immediately prior CVS HEAD code. Using > > > pgbench -i -s 10 bench > > > pgbench -c 10 -t 1000 bench (repeat this half a dozen times) > > > with fsync off but all other settings factory-stock, what I'm seeing > > > is that the first run looks really good but subsequent runs tail off in > > > spectacular fashion :-( Pre-patch there was only minor degradation in > > > successive runs. > > > > > > What I think is happening is that because pgbench depends so heavily on > > > updating existing records, we get into a state where an index page is > > > about full and there's one dead tuple on it, and then for each insertion > > > we have > > > > > > * check for uniqueness marks one more tuple dead (the > > > next-to-last version of the tuple) > > > * newly added code removes one tuple and does a write > > > * now there's enough room to insert one tuple > > > * lather, rinse, repeat, never splitting the page. > > > > > > The problem is that we've traded splitting a page every few hundred > > > inserts for doing a PageIndexMultiDelete, and emitting an extra WAL > > > record, on *every* insert. This is not good. > > > > > > Had you done any performance testing on this patch, and if so what > > > tests did you use? I'm a bit hesitant to try to fix it on the basis > > > of pgbench results alone. > > > > > > One possible fix that comes to mind is to only perform the cleanup > > > if we are able to remove more than one dead tuple (perhaps about 10 > > > would be good). Or do the deletion anyway, but then go ahead and > > > split the page unless X amount of space has been freed (where X is > > > more than just barely enough for the incoming tuple). > > > > > > After all the thought we've put into this, it seems a shame to > > > just abandon it :-(. But it definitely needs more tweaking. > > > > > > regards, tom lane > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 4: Have you searched our list archives? > > > > > > http://archives.postgresql.org > > > > -- > > Bruce Momjian bruce@momjian.us > > EnterpriseDB http://www.enterprisedb.com > > > > + If your life is a hard drive, Christ can be your backup. + > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > -- > Bruce Momjian bruce@momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom, should I apply this patch now? Are you still considering other > options for this? Please wait. This issue is very far down the to-list in terms of size or significance ... but I'll get to it. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> I've applied this but I'm now having some second thoughts about it, >> because I'm seeing an actual *decrease* in pgbench numbers from the >> immediately prior CVS HEAD code. > The attached patch requires the new row to fit, and 10% to be free on > the page. Would someone test that? At the moment, I cannot replicate any consistent difference between CVS head with the patch, without the patch, with the patch plus your BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition. That's whether I use HEAD's broken version of pgbench or one from late July. So I'm feeling a tad frustrated ... but I have no evidence in favor of changing what is in CVS, and accordingly recommend that we leave well enough alone for 8.2. regards, tom lane
OK, removed from open item list. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> I've applied this but I'm now having some second thoughts about it, > >> because I'm seeing an actual *decrease* in pgbench numbers from the > >> immediately prior CVS HEAD code. > > > The attached patch requires the new row to fit, and 10% to be free on > > the page. Would someone test that? > > At the moment, I cannot replicate any consistent difference between > CVS head with the patch, without the patch, with the patch plus your > BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition. > That's whether I use HEAD's broken version of pgbench or one from late > July. So I'm feeling a tad frustrated ... but I have no evidence in > favor of changing what is in CVS, and accordingly recommend that we > leave well enough alone for 8.2. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +