Thread: Page at a time index scan
Here's a patch that implements page at a time index scans discussed at pgsql-hackers earlier. See proposal 1 at: http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php It passes regression tests, and there's no known bugs. There's some minor issues I'd like to point out, though: 1. An index scan now needs to allocate enough memory to hold potentially a whole page worth of items. And if you use markpos/restrpos, twice that much. I don't know if that's an issue, but I thought I'd bring that up. 2. Vacuum is now done in one phase, scanning the index in physical order. That significantly speeds up index vacuums of large indexes that don't fit into memory. However, btbulkdelete doesn't know if the vacuum is a full or lazy one. The patch just assumes it's a lazy vacuum, but the API really needs to be changed to pass that information earlier than at vacuum_cleanup. 3. Before the patch, a scan would keep the current page pinned to keep vacuum from deleting the current item. The patch doesn't change that behaviour, but it now seems to me that even a pin is no longer needed. The patch needs testing and review, to ensure it doesn't brake anything, and to see the effect on performance. It doesn't change disk layout or catalogs, so you can run it using the same data directory as with the unpatched version. - Heikki
On Mon, 2006-05-01 at 22:00 +0300, Heikki Linnakangas wrote: > Here's a patch that implements page at a time index scans discussed at > pgsql-hackers earlier. See proposal 1 at: > http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php > > It passes regression tests, and there's no known bugs. There's > some minor issues I'd like to point out, though: > > 1. An index scan now needs to allocate enough memory to hold potentially a > whole page worth of items. And if you use markpos/restrpos, twice that > much. I don't know if that's an issue, but I thought I'd bring that up. AFAICS the code: - allocates memory for the markpos whether or not its ever needed? Most index scans never call markpos and not all merge joins either, so that seems wasteful. We could allocate when btmarkpos() is called for the first time, if ever. - allocates 1024 offsets in every case. If this were just a unique index retrieval, that would be too much. When scan->is_multiscan == true, go straight for 1024, otherwise start with just 32 offsets and double that when/if required. > 2. Vacuum is now done in one phase, scanning the index in physical order. > That significantly speeds up index vacuums of large indexes that don't fit > into memory. Also for those that *aren't in* memory. Should speed up medium-sized VACUUMs also because of sequential disk access. > However, btbulkdelete doesn't know if the vacuum is a full or > lazy one. The patch just assumes it's a lazy vacuum, but the API really > needs to be changed to pass that information earlier than at vacuum_cleanup. Looks like it needs work. Do you have suggestions while you're there? > 3. Before the patch, a scan would keep the current page pinned to keep > vacuum from deleting the current item. The patch doesn't change that > behaviour, but it now seems to me that even a pin is no longer needed. Agreed. The pin has two functions: - keep the page from being moved out of the bufmgr - no need anymore - stop a vacuum from removing the page - no need anymore. We'll not stop on a removable row anymore, so no need. Some of the code doesn't use standard spacing e.g. "if(" should be "if (", but other than that it looks very neat and well implemented. Overall, I'm optimistic that this patch will help in a number of ways. Speeding up a VACUUM index scan is a primary objective and it looks like that will work well. Also, this looks like it will reduce LWLocking overhead and encourage sequential memory scans of blocks, both of which will improve index scan performance. It should also reduce buffer residency time making shared_buffers more fluid. So, subject to performance tests of this I'm very interested in this. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2 May 2006, Simon Riggs wrote: > On Mon, 2006-05-01 at 22:00 +0300, Heikki Linnakangas wrote: >> >> 1. An index scan now needs to allocate enough memory to hold potentially a >> whole page worth of items. And if you use markpos/restrpos, twice that >> much. I don't know if that's an issue, but I thought I'd bring that up. > > AFAICS the code: > > - allocates memory for the markpos whether or not its ever needed? Most > index scans never call markpos and not all merge joins either, so that > seems wasteful. We could allocate when btmarkpos() is called for the > first time, if ever. Right. I'll do that. > - allocates 1024 offsets in every case. If this were just a unique index > retrieval, that would be too much. When scan->is_multiscan == true, go > straight for 1024, otherwise start with just 32 offsets and double that > when/if required. I wonder if that gets a bit too complicated for saving a small amount of memory? I'll do it if people think it's worth it. Also, could we calculate a better estimate of the maximum number of offsets an index page can hold? There's no way a page can really hold 1024 items. Page headers and special space, line pointers, and the actual keys need some space. >> However, btbulkdelete doesn't know if the vacuum is a full or >> lazy one. The patch just assumes it's a lazy vacuum, but the API really >> needs to be changed to pass that information earlier than at vacuum_cleanup. > > Looks like it needs work. Do you have suggestions while you're there? Now that I look at it: Why do we have a separate vacuum_cleanup function at all? Calls to index_vacuum_cleanup go hand in hand with calls to index_bulk_delete. I thought that index_vacuum_cleanup would only be called after the last cycle of a multi-cycle vacuum, but that doesn't seem to be the case. >> 3. Before the patch, a scan would keep the current page pinned to keep >> vacuum from deleting the current item. The patch doesn't change that >> behaviour, but it now seems to me that even a pin is no longer needed. > > Agreed. The pin has two functions: > - keep the page from being moved out of the bufmgr - no need anymore > - stop a vacuum from removing the page - no need anymore. We'll not stop > on a removable row anymore, so no need. At the moment, backward scan returns to the page to walk left from there. It could be changed to take a copy of the left sibling pointer like forward scans do, eliminating the need to return to the original page in most cases. Also, if there's dead tuples on the page, the scan will need to return to the page to kill them. But you can always re-pin the page when needed. > Some of the code doesn't use standard spacing e.g. "if(" should be "if > (", but other than that it looks very neat and well implemented. Thanks, I'll fix the spacings. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Also, could we calculate a better estimate of the maximum number of > offsets an index page can hold? We could make something analogous to MaxHeapTuplesPerPage --- the correct number ought to be approximately BLCKSZ/16 I should think. (It's not possible for an entry to be *just* the header, there has to be either a datum or a null bitmap. Hence, with maxalign padding, at least 12 bytes for item, plus 4 for item pointer.) > Now that I look at it: Why do we have a separate vacuum_cleanup function > at all? Calls to index_vacuum_cleanup go hand in hand with calls to > index_bulk_delete. Yeah, I was just thinking we ought to revisit that. The original idea was that vacuumcleanup would be called just once at the end of vacuuming, not once per bulkdelete. I don't recall why I changed it but it was probably a bad idea to do so. >> Agreed. The pin has two functions: >> - keep the page from being moved out of the bufmgr - no need anymore >> - stop a vacuum from removing the page - no need anymore. We'll not stop >> on a removable row anymore, so no need. > At the moment, backward scan returns to the page to walk left from there. Backwards scan may break this whole concept; are you sure you've thought it through? regards, tom lane
I wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Now that I look at it: Why do we have a separate vacuum_cleanup function >> at all? Calls to index_vacuum_cleanup go hand in hand with calls to >> index_bulk_delete. > Yeah, I was just thinking we ought to revisit that. The original idea > was that vacuumcleanup would be called just once at the end of vacuuming, > not once per bulkdelete. I don't recall why I changed it but it was > probably a bad idea to do so. I remember why: the design involves passing a palloc'd struct from bulkdelete to vacuumcleanup and there needed to be matching calls to make that behave sanely. We could fix this if the API is something like "the first bulkdelete call palloc's the struct, and *it's passed in to* each subsequent bulkdelete, as well as being passed to vacuumcleanup". So we're short one argument to bulkdelete. This would provide a saner way of dealing with the case of nothing to delete, too: if vacuumcleanup gets a NULL stats pointer, that means bulkdelete wasn't ever called (or was, but chose never to return a non-null pointer). Also, as noted in other contexts, it'd be a good idea if vacuumcleanup was told the total number of heap tuples (GIN needs this), and both steps really ought to be able to find out if it's a full or lazy vacuum. I'll work on making these API changes; the recent GIN patch has left some other detritus that needs to be cleaned up in the same area. regards, tom lane
On Tue, 2 May 2006, Tom Lane wrote: > Also, as noted in other contexts, it'd be a good idea if vacuumcleanup > was told the total number of heap tuples (GIN needs this), and both > steps really ought to be able to find out if it's a full or lazy vacuum. It's already in IndexVacuumCleanupInfo, isn't it? /* Struct for additional arguments passed to vacuum-cleanup operation */ typedef struct IndexVacuumCleanupInfo { bool vacuum_full; /* VACUUM FULL (we have exclusive lock) */ int message_level; /* ereport level for progress messages */ --> double num_heap_tuples; /* tuples remaining in heap */ } IndexVacuumCleanupInfo; gistvacuumcleanup uses num_heap_tuples to set num_index_tuples when it doesn't need to scan the index otherwise. BTW: Is it possible to have a partial gist index? If it is, num_index_tuples = num_heap_tuples isn't right. - Heikki
On Tue, 2 May 2006, Tom Lane wrote: >>> Agreed. The pin has two functions: >>> - keep the page from being moved out of the bufmgr - no need anymore >>> - stop a vacuum from removing the page - no need anymore. We'll not stop >>> on a removable row anymore, so no need. > >> At the moment, backward scan returns to the page to walk left from there. > > Backwards scan may break this whole concept; are you sure you've thought > it through? I think so. The patch doesn't change the walk-left code. Do you have something specific in mind? - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Tue, 2 May 2006, Tom Lane wrote: >> Also, as noted in other contexts, it'd be a good idea if vacuumcleanup >> was told the total number of heap tuples (GIN needs this), and both >> steps really ought to be able to find out if it's a full or lazy vacuum. > It's already in IndexVacuumCleanupInfo, isn't it? Yeah, I had forgotten that, but just noticed it again now. The patch I'm working on at the moment defines /* * Struct for input arguments passed to ambulkdelete and amvacuumcleanup * * Note that num_heap_tuples will not be valid during ambulkdelete, * only amvacuumcleanup. */ typedef struct IndexVacuumInfo { Relation index; /* the index being vacuumed */ bool vacuum_full; /* VACUUM FULL (we have exclusive lock) */ int message_level; /* ereport level for progress messages */ double num_heap_tuples; /* tuples remaining in heap */ } IndexVacuumInfo; with IndexBulkDeleteResult * ambulkdelete (IndexVacuumInfo *info, IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback, void *callback_state); Because of limited <varname>maintenance_work_mem</>, <function>ambulkdelete</> may need to be called more than once when many tuples are to be deleted. The <literal>stats</> argument is the result of the previous call for this index (it is NULL for the first call within a <command>VACUUM</> operation). This allows the AM to accumulate statistics across the whole operation. Typically, <function>ambulkdelete</> will modify and return the same struct if the passed <literal>stats</> is not null. IndexBulkDeleteResult * amvacuumcleanup (IndexVacuumInfo *info, IndexBulkDeleteResult *stats); Clean up after a <command>VACUUM</command> operation (zero or more <function>ambulkdelete</> calls). This does not have to do anything beyond returning index statistics, but it may perform bulk cleanup such as reclaiming empty index pages. <literal>stats</> is whatever the last <function>ambulkdelete</> call returned, or NULL if <function>ambulkdelete</> was not called because no tuples needed to be deleted. If the result is not NULL it must be a palloc'd struct. The statistics it contains will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given. > BTW: Is it possible to have a partial gist index? If it is, > num_index_tuples = num_heap_tuples isn't right. It is, and it isn't ;-). We'll need to see about fixing that. regards, tom lane
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Tue, 2 May 2006, Tom Lane wrote: >> Backwards scan may break this whole concept; are you sure you've thought >> it through? > I think so. The patch doesn't change the walk-left code. Do you have > something specific in mind? I'm worried about synchronization, particularly what happens if the page gets deleted from under you while you don't have it pinned. regards, tom lane
On Tue, 2 May 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Tue, 2 May 2006, Tom Lane wrote: >>> Backwards scan may break this whole concept; are you sure you've thought >>> it through? > >> I think so. The patch doesn't change the walk-left code. Do you have >> something specific in mind? > > I'm worried about synchronization, particularly what happens if the page > gets deleted from under you while you don't have it pinned. AFAICS, shouldn't happen. The half-dead system makes sure that a page won't get deleted while a scan might still be interested in it. It doesn't depend on pins. - Heikki
On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Tue, 2 May 2006, Tom Lane wrote: > >> Backwards scan may break this whole concept; are you sure you've thought > >> it through? > > > I think so. The patch doesn't change the walk-left code. Do you have > > something specific in mind? > > I'm worried about synchronization, particularly what happens if the page > gets deleted from under you while you don't have it pinned. Perhaps I should update my comments on "we don't need a pin at all"... On a Forward scan we need to pin while we are reading a whole page, though can release the pin afterwards. We don't need to keep the pin while servicing btgetnext() requests from our private page buffer though. (Which is what I meant to say.) AFAICS we will need to return to the page for a backward scan, so we could just keep the pin the whole way. It's not possible to cache the left page pointer because block splits to our immediate left can update them even after we read the page contents. (A forward scan need never fear page splits in the same way because existing items can't move past the existing page boundary). We need never return to a page that *could* be deleted. While scanning in either direction, if the complete page contains nothing but dead items we can simply move straight onto the next page, having updated the page status to half-dead. (The great thing about this patch is we should be able to report that somehow, so an asynchronous task handler can come and clean that page (only) now that we don't have a restriction on individual page vacuuming. We can think about somehow later) If only some of the index tuples are deleted, we should only return to the page to update the deleted index tuples *if*: - the page is still in the buffer pool. If its been evicted its because space is tight so we shouldn't call it back just to dirty the page. - we have a minimum threshold of deleted tuples. Otherwise we might re-dirty the page for just a single hint bit, so we end up writing the page out hundreds of times. (Guess: that should be 2 or 3) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > AFAICS we will need to return to the page for a backward scan, so we > could just keep the pin the whole way. It's not possible to cache the > left page pointer because block splits to our immediate left can update > them even after we read the page contents. Sure we can cache the left pointer: it's not any different from the normal walk-left problem, because as soon as you drop pin on the page you are leaving, all the same hazards apply. This algorithm just has a slightly wider window between dropping one pin and acquiring the next. Walk-left is painful and (rarely) expensive, but it's a solved problem. > We need never return to a page that *could* be deleted. While scanning > in either direction, if the complete page contains nothing but dead > items we can simply move straight onto the next page, having updated the > page status to half-dead. This is unnecessary and probably wrong. > - we have a minimum threshold of deleted tuples. Otherwise we might > re-dirty the page for just a single hint bit, so we end up writing the > page out hundreds of times. (Guess: that should be 2 or 3) You are optimizing the wrong thing here. If we choose not to mark an entry dead then we will pay for that omission on every future scan of the same entry. I don't think that outweighs the (doubtless rare) situation where we expend an extra page fetch to reload the page. It's worth noting that all of this stuff is predicated on the assumption that index items never move across pre-existing page boundaries, in either direction. We are therefore going to be permanently giving up any prospect of index space reclamation by merging partly-filled pages (unless maybe in VACUUM FULL). We didn't know how to do that anyway, so I don't feel too bad about it, but if indexscans don't make any attempt to explicitly re-locate their positions then that certainly goes out the window. regards, tom lane
On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > > > I'm worried about synchronization, particularly what happens if the page > > > gets deleted from under you while you don't have it pinned. On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > We need never return to a page that *could* be deleted. While scanning > > in either direction, if the complete page contains nothing but dead > > items we can simply move straight onto the next page, having updated the > > page status to half-dead. > > This is unnecessary and probably wrong. You'll need to be more specific about what you mean. Heikki's concurrent post says roughly the same thing as what I just said, AFAICS. Do you see a problem with page deletion? If so, where? > It's worth noting that all of this stuff is predicated on the assumption > that index items never move across pre-existing page boundaries, in > either direction. We are therefore going to be permanently giving up > any prospect of index space reclamation by merging partly-filled pages > (unless maybe in VACUUM FULL). We didn't know how to do that anyway, > so I don't feel too bad about it, but if indexscans don't make any > attempt to explicitly re-locate their positions then that certainly > goes out the window. Seems like a step forwards to me, even if there is still wish to go further; we've all been trying to improve this behaviour for some time, so hats off to Heikki... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > You are optimizing the wrong thing here. If we choose not to mark an > entry dead then we will pay for that omission on every future scan of > the same entry. I don't think that outweighs the (doubtless rare) > situation where we expend an extra page fetch to reload the page. Sounds a familiar conversation, which I shouldn't have raised here. This depends upon whether the pages being accessed are in cache or not, and whether we have sufficient I/O to pay the cost of a write. Reads don't always go to disk, writes always do. I see that its difficult to tell which is which, but that doesn't mean there aren't different cases. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: >> This is unnecessary and probably wrong. > You'll need to be more specific about what you mean. There is no point in marking a page half-dead, as that doesn't save anyone else from visiting it, and it's probably wrong to mark a leaf page as half-dead at all. That state is associated with upper pages. Even if it were a legal tree configuration, marking the page half-dead would make it impossible to insert any more keys in the page, which doesn't strike me as an appropriate behavior; it's likely to force excess I/O later due to unnecessary page splits during future inserts. regards, tom lane
On Wed, 2006-05-03 at 10:56 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > >> This is unnecessary and probably wrong. > > > You'll need to be more specific about what you mean. > > There is no point in marking a page half-dead, as that doesn't save > anyone else from visiting it, and it's probably wrong to mark a leaf > page as half-dead at all. That state is associated with upper pages. > > Even if it were a legal tree configuration, marking the page half-dead > would make it impossible to insert any more keys in the page, which > doesn't strike me as an appropriate behavior; it's likely to force > excess I/O later due to unnecessary page splits during future inserts. OK. So do you see a problem scenario like this? A, B and C separate backends: A1 Reads page, some row versions are *not* marked LP_DELETE but will be later when A2 happens B1 VACUUM removes dead rows, just happens to be all of them B2 Recycles page into FSM C1 Inserts new data into old page A2 Attempts to update old page to notify about dead rows (UGH!) B2 can only stopped by pinning... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > So do you see a problem scenario like this? > A, B and C separate backends: > A1 Reads page, some row versions are *not* marked LP_DELETE but will be > later when A2 happens > B1 VACUUM removes dead rows, just happens to be all of them > B2 Recycles page into FSM > C1 Inserts new data into old page > A2 Attempts to update old page to notify about dead rows (UGH!) Can't happen; a page cannot be recycled until all concurrent transactions are gone. In any case, the LP_DELETE marking code will certainly take care to check that the entries it's trying to mark are still the same ones it meant to mark. regards, tom lane
On Wed, 2006-05-03 at 13:39 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So do you see a problem scenario like this? > > > A, B and C separate backends: > > A1 Reads page, some row versions are *not* marked LP_DELETE but will be > > later when A2 happens > > B1 VACUUM removes dead rows, just happens to be all of them > > B2 Recycles page into FSM > > C1 Inserts new data into old page > > A2 Attempts to update old page to notify about dead rows (UGH!) > > Can't happen; a page cannot be recycled until all concurrent > transactions are gone. In any case, the LP_DELETE marking code will > certainly take care to check that the entries it's trying to mark > are still the same ones it meant to mark. ! So do you see a problem with page deletion, or not? If so, what is it? This patch looks good to me, based upon everything said. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 3 May 2006, Heikki Linnakangas wrote: > On Tue, 2 May 2006, Tom Lane wrote: > >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> On Tue, 2 May 2006, Tom Lane wrote: >>>> Backwards scan may break this whole concept; are you sure you've thought >>>> it through? >> >>> I think so. The patch doesn't change the walk-left code. Do you have >>> something specific in mind? >> >> I'm worried about synchronization, particularly what happens if the page >> gets deleted from under you while you don't have it pinned. > > AFAICS, shouldn't happen. The half-dead system makes sure that a page won't > get deleted while a scan might still be interested in it. It doesn't depend > on pins. As Tom pointed out elsewhere in this thread, the above explanation is wrong because half-dead state only applies to upper-level pages. I thought that half-dead means that the page is dead and removed from the tree, but not yet recycled because some transaction might still be interested in it. Now I see that that state is actually called "deleted". The point remains, however. A page won't get deleted while a scan might still be interested in it, because deleted pages are not immediately recycled (except on vacuum full), and the left and right sibling pointers stay intact until no transaction can be interested in it. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > The point remains, however. A page won't get deleted while a scan > might still be interested in it, because deleted pages are not > immediately recycled (except on vacuum full), and the left and right > sibling pointers stay intact until no transaction can be interested in it. Right. AFAICS there isn't any assumption there that isn't already made by indexscans, since we drop pin before moving to the adjacent page anyway. You still have to be prepared to deal with the same situations. (The new assumption is that index items won't be moved onto a pre-existing right sibling page; the old scan logic didn't assume that.) Heikki, were you planning to make a round of revisions in the patch, or is this as far as you wanted to take it? regards, tom lane
On Wed, 3 May 2006, Tom Lane wrote: > Heikki, were you planning to make a round of revisions in the patch, > or is this as far as you wanted to take it? Here's an updated patch. It's the same as the original, but merged with the changes to the vacuum_cleanup API you committed, and some comment and space fixes here and there. To-do: * release pin earlier after reading the page, as discussed * allocate memory for markpos/restrpos only when needed * calculate a more precise estimate for the maximum number of items on an index page Not big things, but I don't have the time to work on them at the moment. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Here's an updated patch. Uh ... no patch actually attached? > To-do: > Not big things, but I don't have the time to work on them at the moment. I can take it from here if you'll send me what you have. regards, tom lane
On Thu, 4 May 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Here's an updated patch. > > Uh ... no patch actually attached? Doh. Here you go. >> To-do: >> Not big things, but I don't have the time to work on them at the moment. > > I can take it from here if you'll send me what you have. Thanks! - Heikki
Attachment
I've just realized that the locking considerations in this patch are considerably more complicated than we thought. The discussion so far considered only half of what the deletion interlock needs to accomplish. We have to ensure that indexscans don't lose their place, which is solved in the patch by stopping between pages rather than on a specific item. But we also have to ensure that indexscans don't return pointers to heap tuples that have already been deleted by VACUUM --- at least in the case where the caller is using a non-MVCC snapshot. Else, if the tuple is replaced by a newer tuple before the caller is able to visit it, the caller might mistakenly accept that tuple as the one it wanted. The problematic scenario looks like this: 1. indexscan slurps a bunch of heap TIDs into local memory. It keeps pin, but not lock, on the index page it got the TIDs from. 2. someone else splits the index page, which is allowed since the page is only pinned not locked. Now, some of the index items are on a new page that the indexscan does not have pinned. 3. vacuum comes along and removes some of the moved items. Since they are on a page that no one has pinned, even the super-exclusive lock that vacuum takes won't prevent this. 4. vacuum removes the corresponding heap tuples. 5. someone else installs new, unrelated, tuples into the freed heap slots. 6. indexscan finally visits the heap. It finds tuples that are valid per its snapshot, but aren't what it thought it was finding (they don't necessarily match the index key). While we could prevent this by rechecking whether fetched heap tuples match the index search condition, that costs an awful lot of cycles for such an obviously rare problem. We need a locking-based solution instead. I believe an appropriate fix looks like this: 1. Indexscan is required to keep pin on the page it read the items from (even though we realize they may not still be on that page). The pin can only be dropped when we are no longer going to return any more items read from the page. 2. btbulkdelete is required to get super-exclusive lock on *every leaf page in the index*. It must not try to optimize this down to just the pages containing deletable items. With these rules, even though vacuum might have physically deleted the index item that the indexscan is reporting to its caller, the btbulkdelete call will not have been able to complete. (If bulkdelete arrived at the original page before the indexscan did, then of course it'd already have deleted the item and there's no problem. If it arrives after, then it'll be blocked by not being able to get super-exclusive lock.) Hence vacuum will not have removed the heap tuple yet, even though the index item might be gone. We could improve concurrency if we extend the API so that btgettuple knows whether it is fetching for an MVCC-safe snapshot or not. When the scan is using an MVCC-safe snapshot it's OK to release pin immediately. However I'm not sure if this is worth the trouble --- it probably makes sense to hold onto the pin anyway, in case we're told to mark some of the tuples killed. The patch as given gets point 2 right, but I think it doesn't consistently do point 1, and in any case it's certainly not documenting the issue properly. BTW, I just realized another bug in the patch: btbulkdelete fails to guarantee that it visits every page in the index. It was OK for btvacuumcleanup to ignore pages added to the index after it starts, but btbulkdelete has to deal with such pages. One thing I'm kind of wondering is whether amgetmulti should still exist as a separate API. Seems like if amgettuple is doing the equivalent thing under-the-hood, then amgetmulti isn't saving us anything except a few trips through FunctionCallN. It's not at all clear that it's worth the API complexity of having two different fetch functions. regards, tom lane
I wrote: > BTW, I just realized another bug in the patch: btbulkdelete fails to > guarantee that it visits every page in the index. It was OK for > btvacuumcleanup to ignore pages added to the index after it starts, > but btbulkdelete has to deal with such pages. Actually, as written this patch does not work. At all. btbulkdelete has to guarantee that it removes every index entry it's told to, and it cannot guarantee that in the presence of concurrent page splits. A split could move index items from a page that btbulkdelete hasn't visited to one it's already passed over. This is not possible with an index-order traversal (because splits only move items to the right) but it's definitely possible with a physical-order traversal. I was toying with the idea of remembering deletable pages (which btvacuumcleanup does anyway), which are the only ones that page splits could move items to, and then rescanning those after the completion of the primary pass. This has a couple of pretty unpleasant consequences though: * We have to remember *every* deletable page for correctness, compared to the current situation where btvacuumcleanup can bound the number of pages it tracks. This creates a situation where VACUUM may fail outright if it doesn't have gobs of memory. Since one of the main reasons for developing lazy VACUUM was to ensure we could vacuum arbitrarily large tables in bounded memory, I'm not happy with this. * The rescan could be far from cheap if there are many such pages. Also, I'm not sure when you can stop: it certainly seems possible that items could be moved down during the primary scan and then moved down again while btbulkdelete is reconsidering the deletable pages. Without locking out splits entirely, it's far from obvious that we can make it work. Thoughts? regards, tom lane
On Fri, 5 May 2006, Tom Lane wrote: > I wrote: >> BTW, I just realized another bug in the patch: btbulkdelete fails to >> guarantee that it visits every page in the index. It was OK for >> btvacuumcleanup to ignore pages added to the index after it starts, >> but btbulkdelete has to deal with such pages. > > Actually, as written this patch does not work. At all. btbulkdelete > has to guarantee that it removes every index entry it's told to, and > it cannot guarantee that in the presence of concurrent page splits. > A split could move index items from a page that btbulkdelete hasn't > visited to one it's already passed over. This is not possible with an > index-order traversal (because splits only move items to the right) > but it's definitely possible with a physical-order traversal. True. :( The first solution that occurs to me is to force page splits to choose the target page so that it's blkno > the original page's blkno during vacuum. It would cause the index to become more fragmented more quickly, which is bad but perhaps tolerable. > I was toying with the idea of remembering deletable pages (which > btvacuumcleanup does anyway), which are the only ones that page splits > could move items to, and then rescanning those after the completion > of the primary pass. This has a couple of pretty unpleasant > consequences though: > * We have to remember *every* deletable page for correctness, compared > to the current situation where btvacuumcleanup can bound the number of > pages it tracks. This creates a situation where VACUUM may fail > outright if it doesn't have gobs of memory. Since one of the main > reasons for developing lazy VACUUM was to ensure we could vacuum > arbitrarily large tables in bounded memory, I'm not happy with this. > * The rescan could be far from cheap if there are many such pages. Yep, that's not good. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > The first solution that occurs to me is to force page splits to choose the > target page so that it's blkno > the original page's blkno during vacuum. I thought about that too, but don't like it for three reasons: * it encourages index bloat, the more the longer the vacuum runs. Not good, especially if you've got aggressive vacuum cost delay settings. * there's a locking problem with respect to how you turn that behavior on. * there's a failure mode where the behavior doesn't get turned off if vacuum fails partway through. regards, tom lane
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Fri, 5 May 2006, Tom Lane wrote: >>> BTW, I just realized another bug in the patch: btbulkdelete fails to >>> guarantee that it visits every page in the index. > The first solution that occurs to me is to force page splits to choose the > target page so that it's blkno > the original page's blkno during vacuum. > It would cause the index to become more fragmented more quickly, which is > bad but perhaps tolerable. I have a sketch of a solution that doesn't require any change in page allocation behavior. Can anyone see any holes in this: Assume that we have some way to recognize whether a page has been split since the current btbulkdelete scan started. (A split would mark both the original page and the new right-hand page as newly split.) When btbulkdelete arrives at a page, it need take no special action unless the page is newly split *and* its right-link points to a lower physical address. If that's true, then after vacuuming the page, follow its right-link and vacuum that page; repeat until arriving at a page that is either not newly split or is above the current location of the outer loop. Then return to the outer, sequential-scan loop. We should also have btbulkdelete clear the newly-split marker whenever it processes a page; this ensures that no page is processed more than once, which is probably worth the possible extra I/O needed to clear the marker. (The cycles to re-scan a page are maybe not that important, but if we do reprocess pages we'll end up with a bogusly high tuple count. OTOH I'm not sure we can guarantee an accurate tuple count anyway, so maybe it doesn't matter.) AFAICS, this works correctly even if the test for a newly-split page sometimes yields false positives; thinking a page is newly split when it is not might cost a little extra I/O but won't lead to wrong results. This reduces the requirements for the newly-split marking mechanism. So, how do we do that marking? Noting that we have 16 bits we could use in BTPageOpaqueData without making that struct any bigger (because of alignment considerations), I'm thinking about a 16-bit counter for each index that is incremented at the start of each btbulkdelete operation and copied into the opaque data whenever a page is split. If the value's different from the current counter, the page definitely hasn't been split during btbulkdelete. There's a 1-in-65536 chance of a false positive, if the last split occurred some exact multiple of 65536 vacuums ago, but that's probably low enough to be acceptable. (We could reduce the odds of false positives in various ways, eg by saying that zero isn't a valid counter value and having btbulkdelete reset a page's counter to zero anytime it has to write the page anyway.) This still has the same sort of locking issues I complained of in regards to Heikki's idea, namely how do you make sure that everyone is using the new counter value before you start scanning? That seems soluble however. We'd just need to be willing to take one more lock during every page split operation, which does not seem an intolerable amount of overhead. Then we do something like take a sharelock while fetching the counter value during split (and hold the lock until the split's complete), and take a momentary exclusive lock while advancing the counter during btbulkdelete startup. Thoughts, better ideas? regards, tom lane
On Fri, 5 May 2006, Tom Lane wrote: > I have a sketch of a solution that doesn't require any change in page > allocation behavior. Can anyone see any holes in this: Looks good to me. > Assume that we have some way to recognize whether a page has been split > since the current btbulkdelete scan started. (A split would mark both the > original page and the new right-hand page as newly split.) When > btbulkdelete arrives at a page, it need take no special action unless the > page is newly split *and* its right-link points to a lower physical > address. If that's true, then after vacuuming the page, follow its > right-link and vacuum that page; repeat until arriving at a page that is > either not newly split or is above the current location of the outer loop. > Then return to the outer, sequential-scan loop. It'd be a bit more efficient to finish the sequential-scan first, and memorize the newly-split pages' right-links as they're encountered. Then scan those pages as a separate second pass, or earlier if we run out of memory reserved for memorizing them. > We should also have btbulkdelete clear the newly-split marker whenever > it processes a page; this ensures that no page is processed more than > once, which is probably worth the possible extra I/O needed to clear the > marker. (The cycles to re-scan a page are maybe not that important, but > if we do reprocess pages we'll end up with a bogusly high tuple count. > OTOH I'm not sure we can guarantee an accurate tuple count anyway, > so maybe it doesn't matter.) Yeah, seems worth it to always clear the marker. Definitely when the page is dirtied anyway. > AFAICS, this works correctly even if the test for a newly-split page > sometimes yields false positives; thinking a page is newly split when > it is not might cost a little extra I/O but won't lead to wrong results. > This reduces the requirements for the newly-split marking mechanism. > > So, how do we do that marking? Noting that we have 16 bits we could use > in BTPageOpaqueData without making that struct any bigger (because of > alignment considerations), I'm thinking about a 16-bit counter for each > index that is incremented at the start of each btbulkdelete operation and > copied into the opaque data whenever a page is split. If the value's > different from the current counter, the page definitely hasn't been split > during btbulkdelete. There's a 1-in-65536 chance of a false positive, > if the last split occurred some exact multiple of 65536 vacuums ago, but > that's probably low enough to be acceptable. (We could reduce the odds of > false positives in various ways, eg by saying that zero isn't a valid > counter value and having btbulkdelete reset a page's counter to zero > anytime it has to write the page anyway.) If btbulkdelete always clears the marker (assuming zero isn't a valid value), 16 bits is plenty. Unless a vacuum is aborted, there should never be a value older than current value - 1 in the index. We could live with a 2-bit counter. > This still has the same sort of locking issues I complained of in > regards to Heikki's idea, namely how do you make sure that everyone is > using the new counter value before you start scanning? That seems > soluble however. We'd just need to be willing to take one more lock > during every page split operation, which does not seem an intolerable > amount of overhead. Then we do something like take a sharelock while > fetching the counter value during split (and hold the lock until the > split's complete), and take a momentary exclusive lock while advancing > the counter during btbulkdelete startup. That's not too bad. Where exactly were you thinking of putting the counter and the lock? > Thoughts, better ideas? Well, we could enhance my proposal a bit to make the fragmentation effect less severe. Instead of a simple flag indicating that a vacuum is in progress, vacuum could announce the current page it's processing in a per-index shmem variable. A page split must read that counter, holding a shared lock, and choose the target page so that that boundary is not crossed so that the new page is below the boundary and old page above. Otherwise, it's free to choose what it wants. To make good target page choices, it needs some help from FSM. I think I like your proposal more, though. It seems better concurrency-wise. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > That's not too bad. Where exactly were you thinking of putting the > counter and the lock? My original thought was to keep it in btree metapages, but that's kind of annoying since we just went to some effort to cache metapage contents; which means the metapage will likely not be hanging around in buffer cache. Right now I'm thinking we could have a single system-wide counter instead of one per index, and keep it in shared memory. There's no strong reason why it needs to survive system crash/restart AFAICS. Aside from the counter proper, shared memory would need to contain enough info so we could tell whether btbulkdelete was active for any particular index, and the marker value to use for that index if so. But this requires only enough space for 1 entry per active vacuum, which is bounded by MaxBackends and will usually be a lot smaller. A benefit of doing it that way is that when btbulkdelete isn't active, splits can reliably tell so and write zero into the page markers. That reduces the odds of false positives quite a lot, assuming that bulkdelete is only active a small fraction of the time. BTW, I'm currently working on cleaning up the page-at-a-time-scan part of your patch, since we could apply that without changing VACUUM at all. We need to do performance testing on that and make sure it won't kill us for unique indexes, else this whole project may have to be shelved :-(. I've found a few more problems, eg the code doesn't handle changes in indexscan direction properly, but no showstoppers. regards, tom lane
I've committed a rewritten version of this patch. Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Fri, 5 May 2006, Tom Lane wrote: >> btbulkdelete arrives at a page, it need take no special action unless the >> page is newly split *and* its right-link points to a lower physical >> address. If that's true, then after vacuuming the page, follow its >> right-link and vacuum that page; repeat until arriving at a page that is >> either not newly split or is above the current location of the outer loop. >> Then return to the outer, sequential-scan loop. > It'd be a bit more efficient to finish the sequential-scan first, and > memorize the newly-split pages' right-links as they're encountered. Then > scan those pages as a separate second pass, or earlier if we run out of > memory reserved for memorizing them. I didn't do this. Aside from the extra memory requirement, it's not apparent to me that it'd make things faster. The disadvantage is that it would require more page reads than the other way: if you visit the split page immediately, and note that its right-link is above the current outer loop location, then you can skip following the right-link because you know you'll visit the page later. If you postpone then you have to chase every chain until actually reading a page with an old cycle ID. I think this extra I/O would likely outweigh any savings from not interrupting the main scan. > If btbulkdelete always clears the marker (assuming zero isn't a valid > value), 16 bits is plenty. Unless a vacuum is aborted, there should never > be a value older than current value - 1 in the index. We could live with a > 2-bit counter. For the moment, the code is only clearing the marker if it's equal to the current cycle ID. This is sufficient to recognize definitely-already-processed pages, but it doesn't prevent false positives in general. If we ever need the space we could narrow the counter, at the cost of having to expend more I/O to keep the values cleared. regards, tom lane
On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Fri, 5 May 2006, Tom Lane wrote: > >>> BTW, I just realized another bug in the patch: btbulkdelete fails to > >>> guarantee that it visits every page in the index. > > > The first solution that occurs to me is to force page splits to choose the > > target page so that it's blkno > the original page's blkno during vacuum. > > It would cause the index to become more fragmented more quickly, which is > > bad but perhaps tolerable. > > I have a sketch of a solution that doesn't require any change in page > allocation behavior. Can anyone see any holes in this: > > Assume that we have some way to recognize whether a page has been split > since the current btbulkdelete scan started. (A split would mark both the > original page and the new right-hand page as newly split.) When > btbulkdelete arrives at a page, it need take no special action unless the > page is newly split *and* its right-link points to a lower physical > address. If that's true, then after vacuuming the page, follow its > right-link and vacuum that page; repeat until arriving at a page that is > either not newly split or is above the current location of the outer loop. > Then return to the outer, sequential-scan loop. > > We should also have btbulkdelete clear the newly-split marker whenever > it processes a page; this ensures that no page is processed more than > once, which is probably worth the possible extra I/O needed to clear the > marker. (The cycles to re-scan a page are maybe not that important, but > if we do reprocess pages we'll end up with a bogusly high tuple count. > OTOH I'm not sure we can guarantee an accurate tuple count anyway, > so maybe it doesn't matter.) > > AFAICS, this works correctly even if the test for a newly-split page > sometimes yields false positives; thinking a page is newly split when > it is not might cost a little extra I/O but won't lead to wrong results. > This reduces the requirements for the newly-split marking mechanism. > > So, how do we do that marking? Noting that we have 16 bits we could use > in BTPageOpaqueData without making that struct any bigger (because of > alignment considerations), I'm thinking about a 16-bit counter for each > index that is incremented at the start of each btbulkdelete operation and > copied into the opaque data whenever a page is split. If the value's > different from the current counter, the page definitely hasn't been split > during btbulkdelete. There's a 1-in-65536 chance of a false positive, > if the last split occurred some exact multiple of 65536 vacuums ago, but > that's probably low enough to be acceptable. (We could reduce the odds of > false positives in various ways, eg by saying that zero isn't a valid > counter value and having btbulkdelete reset a page's counter to zero > anytime it has to write the page anyway.) I read your earlier post about needing to lock everything and spent some time thinking about this. The issue of needing to lock everything means that we would never be able to do a partial vacuum of an index i.e. remove one page without a scan. I'm more concerned about making partial vacuum work than I am about speeding up an all-block vacuum. I'd been mulling over the locking problems and came up with something that looks identical to your proposal *except* for the value that gets written into the BTPageOpaqueData on the right-hand page. My thinking was to write the blockid of the original left hand page, so as to record the original ancestor since split. Thus if multiple splits occurred, then the original ancestor blockid would remain on record until VACUUM. In more detail: When we split a page if the ancestor blockid is not set, then we set it to be the blockid of the old page (new left hand page). If the ancestor blockid is already set then we copy that to the new right hand page. Every split will write a value to BTPageOpaqueData, though the values to use are already available without extra work. AFAICS this doesn't change the ability to scan the index in physical order, so we still get the benefits for that action. A *partial* vacuum of an index block can be achieved by visiting the block to be vacuumed, then following the link directly to the pre-split ancestor block (if there is one), then moving right again back to the target block. btbulkdelete always clears the marker when it cleans. This opens the door for the approach of notifying when a page is deletable and then having a background agent remove just that page, as patch-proposed recently by Junji TERAMOTO. I'm *not* presuming we have an agreed solution for partial vacuuming, but I do want to keep that door open by implementing a locking protocol that could support it. > This still has the same sort of locking issues I complained of in > regards to Heikki's idea, namely how do you make sure that everyone is > using the new counter value before you start scanning? That seems > soluble however. We'd just need to be willing to take one more lock > during every page split operation, which does not seem an intolerable > amount of overhead. Then we do something like take a sharelock while > fetching the counter value during split (and hold the lock until the > split's complete), and take a momentary exclusive lock while advancing > the counter during btbulkdelete startup. I'm not very happy about an extra lock during page splitting, which adds a performance hit even for tables that never will need regular vacuuming (apart from occaisional wrap-around avoidance). AFAICS if we just store the ancestor blockid we don't need to do the business with shared memory and extra locks etc.. The size of the field we hold for BTPageOpaqueData seems fairly unimportant, since we leave lots of space in index pages anyway - a few extra bytes would only make a noise level change in performance. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > I read your earlier post about needing to lock everything and spent some > time thinking about this. The issue of needing to lock everything means > that we would never be able to do a partial vacuum of an index i.e. > remove one page without a scan. I'm more concerned about making partial > vacuum work than I am about speeding up an all-block vacuum. [ shrug... ] That's an illusory goal anyway. Retail tuple removal is just too inefficient. (No, I don't believe in that proposed patch.) > My thinking was to write the blockid of the original left hand page, so > as to record the original ancestor since split. Thus if multiple splits > occurred, then the original ancestor blockid would remain on record > until VACUUM. In more detail: When we split a page if the ancestor > blockid is not set, then we set it to be the blockid of the old page > (new left hand page). If the ancestor blockid is already set then we > copy that to the new right hand page. Every split will write a value to > BTPageOpaqueData, though the values to use are already available without > extra work. Doesn't work, at least not for making it possible to vacuum part of the index. The conflicting indexscan could have stopped on a page, and then that page could have split, before your "partial vacuum" ever started. So tracing back only as far as the data has split since vacuum started is not enough to prevent conflict. (The other little problem is that we'd have to enlarge the BTOpaque overhead, because a block id doesn't fit in the available 16 bits.) > I'm not very happy about an extra lock during page splitting, which adds > a performance hit even for tables that never will need regular vacuuming > (apart from occaisional wrap-around avoidance). While I'd rather not have done that, I don't believe that it makes for any material performance degradation. Normal splits all take the lock in shared mode and hence suffer no contention. Your proposal wouldn't make for less locking anyway, since it still assumes that there's a way to tell whether vacuum is active for a given index, which is just about the same amount of overhead as the code-as-committed. regards, tom lane
On Mon, 2006-05-08 at 10:18 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I read your earlier post about needing to lock everything and spent some > > time thinking about this. The issue of needing to lock everything means > > that we would never be able to do a partial vacuum of an index i.e. > > remove one page without a scan. I'm more concerned about making partial > > vacuum work than I am about speeding up an all-block vacuum. > > [ shrug... ] That's an illusory goal anyway. Retail tuple removal is > just too inefficient. (No, I don't believe in that proposed patch.) Current VACUUM optimizes for the case where random updates/deletes occur. If there are hotspots then scanning the whole relation is O(N) or even O(N^2) if we have to scan the indexes multiple times. We think we have a way to improve heap VACUUMs (bitmaps...) but we are still looking for an equivalent for indexes. > > My thinking was to write the blockid of the original left hand page, so > > as to record the original ancestor since split. Thus if multiple splits > > occurred, then the original ancestor blockid would remain on record > > until VACUUM. In more detail: When we split a page if the ancestor > > blockid is not set, then we set it to be the blockid of the old page > > (new left hand page). If the ancestor blockid is already set then we > > copy that to the new right hand page. Every split will write a value to > > BTPageOpaqueData, though the values to use are already available without > > extra work. > > Doesn't work, at least not for making it possible to vacuum part of the > index. The conflicting indexscan could have stopped on a page, and then > that page could have split, before your "partial vacuum" ever started. > So tracing back only as far as the data has split since vacuum started > is not enough to prevent conflict. That wasn't the proposal. Every split would be marked and stay marked until those blocks were VACUUMed. The data used to mark is readily available and doesn't rely on whether or not VACUUM is running. IMHO this does work. > (The other little problem is that we'd have to enlarge the BTOpaque > overhead, because a block id doesn't fit in the available 16 bits.) ISTM it is indeed a little problem. After CREATE INDEX, most index pages don't stay completely full, so worrying about alignment wastage might very occasionally save an I/O, but not enough for us to care. > > I'm not very happy about an extra lock during page splitting, which adds > > a performance hit even for tables that never will need regular vacuuming > > (apart from occaisional wrap-around avoidance). > > While I'd rather not have done that, I don't believe that it makes for > any material performance degradation. Normal splits all take the lock > in shared mode and hence suffer no contention. Your proposal wouldn't > make for less locking anyway, since it still assumes that there's a way > to tell whether vacuum is active for a given index, which is just about > the same amount of overhead as the code-as-committed. The proposed scheme doesn't rely on knowing whether a VACUUM is active or not. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > That wasn't the proposal. Every split would be marked and stay marked > until those blocks were VACUUMed. The data used to mark is readily > available and doesn't rely on whether or not VACUUM is running. > IMHO this does work. OK, I misunderstood what you had in mind, but now that I do understand it doesn't seem terribly efficient. What you're suggesting is that we take as a "vacuum group" all the pages that have been split off from a single original page since that page was last vacuumed, and that this group must be vacuumed as a whole. That works, but it seems that the groups would get awfully large. In particular, this substantially penalizes btbulkdelete in hopes of someday improving matters for what remains an entirely fictional partial vacuum. As it stands today, btbulkdelete only has to worry about page groups formed since it began to run, not since the last vacuum. Changing the data representation like this would force it to retrace much more often and over much larger page groups. >> (The other little problem is that we'd have to enlarge the BTOpaque >> overhead, because a block id doesn't fit in the available 16 bits.) > ISTM it is indeed a little problem. After CREATE INDEX, most index pages > don't stay completely full, so worrying about alignment wastage might > very occasionally save an I/O, but not enough for us to care. I don't think an extra 4 or 8 bytes need be a show-stopper, but it does force an initdb and thus make it harder to compare the two techniques. regards, tom lane
On Mon, 2006-05-08 at 11:26 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > That wasn't the proposal. Every split would be marked and stay marked > > until those blocks were VACUUMed. The data used to mark is readily > > available and doesn't rely on whether or not VACUUM is running. > > IMHO this does work. > > OK, I misunderstood what you had in mind, but now that I do understand > it doesn't seem terribly efficient. What you're suggesting is that we > take as a "vacuum group" all the pages that have been split off from a > single original page since that page was last vacuumed, and that this > group must be vacuumed as a whole. That works, but it seems that the > groups would get awfully large. In particular, this substantially > penalizes btbulkdelete in hopes of someday improving matters for what > remains an entirely fictional partial vacuum. OK, so we have the germ of a new mechanism - and I very much agree that the idea of a partial vacuum is at present entirely fictional...but we at least have a starting place. > As it stands today, > btbulkdelete only has to worry about page groups formed since it began > to run, not since the last vacuum. Changing the data representation > like this would force it to retrace much more often and over much larger > page groups. Yes, I saw the potential issue you mention - but for many cases the index grows forwards and so we wouldn't care in either case. Page splits that go to lower blockids are limited by available space, so would be less of a problem. I'm balancing the additional cost of page splits against the additional cost on the vacuum. I would prefer to keep in-line ops faster and pay a little extra on the out-of-line operations, if thats what it takes. I note your point that there is little contention, but there is still a cost and in many cases this cost is being paid on tables that never will be VACUUMed. For insert-intensive apps, this adds cost, for little benefit. For update-intensive apps, we're VACUUMing continually anyway so there's no benefit from doing this only-during-VACUUM. So we just optimised for slowly-but-continually churning tables (i.e. DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM performance for those that don't need it that often. That might be the common case, but it isn't the one thats hurting most. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > So we just optimised for slowly-but-continually churning tables (i.e. > DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM > performance for those that don't need it that often. That might be the > common case, but it isn't the one thats hurting most. No, we just improved VACUUM performance for those that need it most. I was just doing some desultory experiments with today's code versus yesterday's (it's easy to make a direct comparison because they're initdb-compatible ... just stop one postmaster executable and start another on the same database). I made a table of 16M rows with an index over a random-data integer column. With a thoroughly disordered index (built on-the-fly as the random data was inserted), the time to VACUUM after deleting a small number of rows was 615 seconds with yesterday's code, 31 seconds today. With a perfectly-ordered index (identical table, but CREATE INDEX after all the data is in place), the times were about 28 and 26 seconds respectively. (I wouldn't put a *whole* lot of faith in these numbers, since they're from a Dell machine with a toy ATA drive, but anyway they do show that sequential access to the index makes a huge difference.) But perfectly-ordered indexes are impossible to maintain unless your data is either static or insert-at- the-end-only. Anyway, while the extra LWLock and shared-memory access during a split is annoying, I think it's really negligible (and so does pgbench). A page split is going to involve many times that much work --- you've got to acquire lock on at least four different buffers, visit the FSM, possibly ask the kernel for another disk page, do XLogInsert twice, etc. Any one of those things involves significantly more CPU effort and contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid of it, not at the price of slowing down VACUUM again. regards, tom lane
Tom, On 5/8/06 11:46 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I made a table of 16M rows with an > index over a random-data integer column. With a thoroughly disordered > index (built on-the-fly as the random data was inserted), the time to > VACUUM after deleting a small number of rows was 615 seconds with > yesterday's code, 31 seconds today. With a perfectly-ordered index > (identical table, but CREATE INDEX after all the data is in place), the > times were about 28 and 26 seconds respectively. Very impressive! This corroborates findings we've had with index maintenance in the field - thanks for finding/fixing this. - Luke
On Mon, 2006-05-08 at 14:46 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So we just optimised for slowly-but-continually churning tables (i.e. > > DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM > > performance for those that don't need it that often. That might be the > > common case, but it isn't the one thats hurting most. > > No, we just improved VACUUM performance for those that need it most. > I was just doing some desultory experiments with today's code versus > yesterday's (it's easy to make a direct comparison because they're > initdb-compatible ... just stop one postmaster executable and start > another on the same database). I made a table of 16M rows with an > index over a random-data integer column. With a thoroughly disordered > index (built on-the-fly as the random data was inserted), the time to > VACUUM after deleting a small number of rows was 615 seconds with > yesterday's code, 31 seconds today. With a perfectly-ordered index > (identical table, but CREATE INDEX after all the data is in place), the > times were about 28 and 26 seconds respectively. (I wouldn't put a > *whole* lot of faith in these numbers, since they're from a Dell machine > with a toy ATA drive, but anyway they do show that sequential access to > the index makes a huge difference.) But perfectly-ordered indexes are > impossible to maintain unless your data is either static or insert-at- > the-end-only. You and Heikki have achieved a marvelous thing, well done. > Anyway, while the extra LWLock and shared-memory access during a split > is annoying, I think it's really negligible (and so does pgbench). > A page split is going to involve many times that much work --- you've > got to acquire lock on at least four different buffers, visit the FSM, > possibly ask the kernel for another disk page, do XLogInsert twice, etc. > Any one of those things involves significantly more CPU effort and > contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid > of it, not at the price of slowing down VACUUM again. I'll raise the partial vacuum topic on -hackers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com