Thread: On-the-fly index tuple deletion vs. hot_standby
I have a hot_standby system and use it to bear the load of various reporting queries that take 15-60 minutes each. In an effort to avoid long pauses in recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of the master's transactions. Even so, I kept seeing recovery pause for the duration of a long-running query. In each case, the culprit record was an XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The attached test script demonstrates the behavior (on HEAD); the index tuple reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby. Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does not regard the inserting-transaction outcome, so btree_redo proceeds to conflict with snapshots having visibility over that transaction. Could we correctly improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples of aborted transactions and tuples inserted and deleted within one transaction? Thanks, nm
Attachment
On 29.11.2010 08:10, Noah Misch wrote: > I have a hot_standby system and use it to bear the load of various reporting > queries that take 15-60 minutes each. In an effort to avoid long pauses in > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of > the master's transactions. Even so, I kept seeing recovery pause for the > duration of a long-running query. In each case, the culprit record was an > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The > attached test script demonstrates the behavior (on HEAD); the index tuple > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby. > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict > with snapshots having visibility over that transaction. Could we correctly > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples > of aborted transactions and tuples inserted and deleted within one transaction? Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() could just call HeapTupleHeaderAdvanceLatestRemoveXid(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote: > On 29.11.2010 08:10, Noah Misch wrote: > > I have a hot_standby system and use it to bear the load of various reporting > > queries that take 15-60 minutes each. In an effort to avoid long pauses in > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of > > the master's transactions. Even so, I kept seeing recovery pause for the > > duration of a long-running query. In each case, the culprit record was an > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The > > attached test script demonstrates the behavior (on HEAD); the index tuple > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby. > > > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove > > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict > > with snapshots having visibility over that transaction. Could we correctly > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples > > of aborted transactions and tuples inserted and deleted within one transaction? @Noah Easily the best bug reported submitted in a long time. Thanks. > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() > could just call HeapTupleHeaderAdvanceLatestRemoveXid(). Yes, it applies to other cases also. Thanks for the suggestion. Fix committed. Please double-check my work, committed early since I'm about to jump on a plane. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote: > On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote: > > On 29.11.2010 08:10, Noah Misch wrote: > > > I have a hot_standby system and use it to bear the load of various reporting > > > queries that take 15-60 minutes each. In an effort to avoid long pauses in > > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of > > > the master's transactions. Even so, I kept seeing recovery pause for the > > > duration of a long-running query. In each case, the culprit record was an > > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The > > > attached test script demonstrates the behavior (on HEAD); the index tuple > > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby. > > > > > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports > > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove > > > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does > > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict > > > with snapshots having visibility over that transaction. Could we correctly > > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples > > > of aborted transactions and tuples inserted and deleted within one transaction? > > @Noah Easily the best bug reported submitted in a long time. Thanks. > > > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need > > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() > > could just call HeapTupleHeaderAdvanceLatestRemoveXid(). > > Yes, it applies to other cases also. Thanks for the suggestion. > > Fix committed. Please double-check my work, committed early since I'm > about to jump on a plane. Thanks for making that change. For my understanding, why does the xmin == xmax special case in HeapTupleHeaderAdvanceLatestRemoveXid not require !HEAP_UPDATED, as the corresponding case in HeapTupleSatisfiesVacuum requires? I can neither think of a recipe for triggering a problem as the code stands, nor come up with a sound explanation for why no such recipe can exist. nm
On 10.12.2010 19:55, Noah Misch wrote: > On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote: >> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote: >>> Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need >>> similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() >>> could just call HeapTupleHeaderAdvanceLatestRemoveXid(). >> >> Yes, it applies to other cases also. Thanks for the suggestion. >> >> Fix committed. Please double-check my work, committed early since I'm >> about to jump on a plane. > > Thanks for making that change. For my understanding, why does the xmin == xmax > special case in HeapTupleHeaderAdvanceLatestRemoveXid not require !HEAP_UPDATED, > as the corresponding case in HeapTupleSatisfiesVacuum requires? I can neither > think of a recipe for triggering a problem as the code stands, nor come up with > a sound explanation for why no such recipe can exist. The difference is in the purpose of those two functions. HeapTupleSatisfiesVacuum decides if a tuple can be safely vacuumed away. For that purpose, you can't remove a tuple from the middle of an update chain, even if that tuple was never visible to any other transaction, because someone might still want to follow the update chain to find the latest version of the row. HeapTupleHeaderAdvanceLatestRemoveXid is used to decide if removing a tuple would conflict with in-progress hot standby queries. For that purpose, you don't need to care about breaking update chains, as Hot Standby is only used for read-only queries and read-only queries never follow update chains. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote: > On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote: > > On 29.11.2010 08:10, Noah Misch wrote: > > > I have a hot_standby system and use it to bear the load of various reporting > > > queries that take 15-60 minutes each. In an effort to avoid long pauses in > > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of > > > the master's transactions. Even so, I kept seeing recovery pause for the > > > duration of a long-running query. In each case, the culprit record was an > > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The > > > attached test script demonstrates the behavior (on HEAD); the index tuple > > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby. > > > > > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports > > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove > > > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does > > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict > > > with snapshots having visibility over that transaction. Could we correctly > > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples > > > of aborted transactions and tuples inserted and deleted within one transaction? > > @Noah Easily the best bug reported submitted in a long time. Thanks. > > > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need > > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() > > could just call HeapTupleHeaderAdvanceLatestRemoveXid(). > > Yes, it applies to other cases also. Thanks for the suggestion. > > Fix committed. Please double-check my work, committed early since I'm > about to jump on a plane. The installation that inspired my original report recently upgraded from 9.0.1 to 9.0.3, and your fix did significantly decrease its conflict frequency. The last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've attached a test script demonstrating the behavior. _bt_page_recyclable approves any page deleted no more recently than RecentXmin, because we need only ensure that every ongoing scan has witnessed the page as dead. For the hot standby case, we need to account for possibly-ongoing standby transactions. Using RecentGlobalXmin covers that, albeit with some pessimism: we really only need LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay off, though. Thoughts? Thanks, nm
Attachment
On 12.03.2011 12:40, Noah Misch wrote: > The installation that inspired my original report recently upgraded from 9.0.1 > to 9.0.3, and your fix did significantly decrease its conflict frequency. The > last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. > (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've > attached a test script demonstrating the behavior. _bt_page_recyclable approves > any page deleted no more recently than RecentXmin, because we need only ensure > that every ongoing scan has witnessed the page as dead. For the hot standby > case, we need to account for possibly-ongoing standby transactions. Using > RecentGlobalXmin covers that, albeit with some pessimism: we really only need > LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) > - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay > off, though. Thoughts? Hmm, instead of bloating the master, I wonder if we could detect more accurately if there are any on-going scans, in the standby. For example, you must hold a lock on the index to scan it, so only transactions holding the lock need to be checked for conflict. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: > On 12.03.2011 12:40, Noah Misch wrote: >> The installation that inspired my original report recently upgraded from 9.0.1 >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've >> attached a test script demonstrating the behavior. _bt_page_recyclable approves >> any page deleted no more recently than RecentXmin, because we need only ensure >> that every ongoing scan has witnessed the page as dead. For the hot standby >> case, we need to account for possibly-ongoing standby transactions. Using >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay >> off, though. Thoughts? > > Hmm, instead of bloating the master, I wonder if we could detect more > accurately if there are any on-going scans, in the standby. For example, > you must hold a lock on the index to scan it, so only transactions > holding the lock need to be checked for conflict. That would be nice. Do you have an outline of an implementation in mind?
On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: > > On 12.03.2011 12:40, Noah Misch wrote: > >> The installation that inspired my original report recently upgraded from 9.0.1 > >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've > >> attached a test script demonstrating the behavior. _bt_page_recyclable approves > >> any page deleted no more recently than RecentXmin, because we need only ensure > >> that every ongoing scan has witnessed the page as dead. For the hot standby > >> case, we need to account for possibly-ongoing standby transactions. Using > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) > >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay > >> off, though. Thoughts? > > > > Hmm, instead of bloating the master, I wonder if we could detect more > > accurately if there are any on-going scans, in the standby. For example, > > you must hold a lock on the index to scan it, so only transactions > > holding the lock need to be checked for conflict. > > That would be nice. Do you have an outline of an implementation in mind? In an attempt to resuscitate this thread, here's my own shot at that. Apologies in advance if it's just an already-burning straw man. I didn't see any way to take advantage of checking for the heavyweight lock that any index scan would need to hold. Such a check becomes obsolete the moment it's done -- nothing stops another locker from arriving between the completion of your check and whatever you do based on its outcome. When a standby is in the picture, the answer needs to be decided at xlog generation time but valid for xlog apply time. Therefore, I instead looked for what could be done with buffer locks. Regarding the underlying requirement, here is the relevant paragraph from src/backend/access/nbtree/README: A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must stayin place with its right-link undisturbed. We implement this by waiting until all transactions that were running at thetime of deletion are dead; which is overly strong, but is simple to implement within Postgres. When marked dead, a deletedpage is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this transactionnumber is older than the oldest open transaction. These threads also have useful background as concerns hot standby: http://archives.postgresql.org/message-id/23761.1265596434@sss.pgh.pa.us http://archives.postgresql.org/message-id/4B7D3135.3090204@enterprisedb.com http://archives.postgresql.org/message-id/1268123581.10620.76.camel@ebony This is implemented in two passes. After updating the adjacent page links, the first pass stores ReadNewTransactionId() in the page. The second pass finishes the reuse if the stored xid is older than RecentXmin. This has the benefits of simplicity and low contention. However, any long-running transaction delays reuse. Also, since RecentXmin's witness of "all transactions that were running at the time of deletion" only includes master transactions, our tools for preventing recovery conflicts (vacuum_defer_cleanup_age, feedback) do not defend against conflicts arising from b-tree page reuse. Fixing that means either choosing different page reuse criteria or letting standby transactions also delay reuse on the master. If I understand correctly, "has a reference to it" implies a pin on the left, right or parent page. By the time _bt_pagedel is ready to mark the page BTP_DELETED, we already hold exclusive content locks on all of those buffers. Suppose we then check the pin count on each. If they have only single local pins, we have serendipitously acquired cleanup locks: set btpo.xact to FrozenTransactionId, allowing reclamation at any time. Otherwise, continue as the code stands today. Also change _bt_recycleable to use RecentGlobalXmin instead of RecentXmin; this avoids the recovery conflicts. Since, we hope, the vast majority of deleted pages will get FrozenTransactionId, the marginal bloat added by using RecentGlobalXmin will not be significant. This also saves a LW_SHARED acquisition of XidGenLock in the uncontended case -- probably not significant, but it can't hurt. When the standby replays the XLOG_BTREE_DELETE_PAGE with btpo_xact == FrozenTransactionId, it must wait for a cleanup lock on all the buffers it updates. We could actually do this anytime between the XLOG_BTREE_DELETE_PAGE and the later XLOG_BTREE_REUSE_PAGE, but we only have the necessary information handy when applying the first record. Something this intrusive is clearly 9.2 material. It would be nice to have a backpatchable strategy to prop up vacuum_defer_cleanup_age and hot_standby_feedback in 9.0 and 9.1. For that, I haven't come up with anything better than my original pair of proposals. Comments? Thanks, nm
On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote: > On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: > > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: > > > On 12.03.2011 12:40, Noah Misch wrote: > > >> The installation that inspired my original report recently upgraded from 9.0.1 > > >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The > > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. > > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've > > >> attached a test script demonstrating the behavior. _bt_page_recyclable approves > > >> any page deleted no more recently than RecentXmin, because we need only ensure > > >> that every ongoing scan has witnessed the page as dead. For the hot standby > > >> case, we need to account for possibly-ongoing standby transactions. Using > > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need > > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) > > >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay > > >> off, though. Thoughts? > > > > > > Hmm, instead of bloating the master, I wonder if we could detect more > > > accurately if there are any on-going scans, in the standby. For example, > > > you must hold a lock on the index to scan it, so only transactions > > > holding the lock need to be checked for conflict. > > > > That would be nice. Do you have an outline of an implementation in mind? > > In an attempt to resuscitate this thread, here's my own shot at that. Apologies > in advance if it's just an already-burning straw man. [full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net] Anyone care to comment? On this system, which has vacuum_defer_cleanup_age set to 3 peak hours worth of xid consumption, the problem caps recovery conflict hold off at 10-20 minutes. It will have the same effect on standby feedback in 9.1. I think we should start by using RecentGlobalXmin instead of RecentXmin as the reuse horizon when wal_level = hot_standby, and backpatch that. Then, independently consider for master a bloat-avoidance improvement like I outlined most recently; I'm not sure whether that's worth it. In any event, I'm hoping to get some consensus on the way forward. Thanks, nm
On Fri, Apr 22, 2011 at 11:10 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: >> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: >> > On 12.03.2011 12:40, Noah Misch wrote: >> >> The installation that inspired my original report recently upgraded from 9.0.1 >> >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The >> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. >> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've >> >> attached a test script demonstrating the behavior. _bt_page_recyclable approves >> >> any page deleted no more recently than RecentXmin, because we need only ensure >> >> that every ongoing scan has witnessed the page as dead. For the hot standby >> >> case, we need to account for possibly-ongoing standby transactions. Using >> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need >> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) >> >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay >> >> off, though. Thoughts? >> > >> > Hmm, instead of bloating the master, I wonder if we could detect more >> > accurately if there are any on-going scans, in the standby. For example, >> > you must hold a lock on the index to scan it, so only transactions >> > holding the lock need to be checked for conflict. >> >> That would be nice. Do you have an outline of an implementation in mind? > > In an attempt to resuscitate this thread, here's my own shot at that. Apologies > in advance if it's just an already-burning straw man. > > I didn't see any way to take advantage of checking for the heavyweight lock that > any index scan would need to hold. Have you looked at the logic in ResolveRecoveryConflictWithLock(), and at GetLockConflicts()? I am a little fuzzy on how the btree stuff works, but it seems to me that you are looking for transactions that both have an xmin before some threshold and also hold an AccessShareLock on some relation. GetLockConflicts() will provide the latter, at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, On Sat, Jun 11, 2011 at 08:55:28PM -0400, Robert Haas wrote: > On Fri, Apr 22, 2011 at 11:10 AM, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: > >> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: > >> > On 12.03.2011 12:40, Noah Misch wrote: > >> >> The installation that inspired my original report recently upgraded from 9.0.1 > >> >> to 9.0.3, and your fix did significantly decrease its conflict frequency. ?The > >> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. > >> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) ?I've > >> >> attached a test script demonstrating the behavior. ?_bt_page_recyclable approves > >> >> any page deleted no more recently than RecentXmin, because we need only ensure > >> >> that every ongoing scan has witnessed the page as dead. ?For the hot standby > >> >> case, we need to account for possibly-ongoing standby transactions. ?Using > >> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need > >> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) > >> >> - vacuum_defer_cleanup_age. ?Not sure the accounting to achieve that would pay > >> >> off, though. ?Thoughts? > >> > > >> > Hmm, instead of bloating the master, I wonder if we could detect more > >> > accurately if there are any on-going scans, in the standby. For example, > >> > you must hold a lock on the index to scan it, so only transactions > >> > holding the lock need to be checked for conflict. > >> > >> That would be nice. ?Do you have an outline of an implementation in mind? > > > > In an attempt to resuscitate this thread, here's my own shot at that. ?Apologies > > in advance if it's just an already-burning straw man. > > > > I didn't see any way to take advantage of checking for the heavyweight lock that > > any index scan would need to hold. > > Have you looked at the logic in ResolveRecoveryConflictWithLock(), and > at GetLockConflicts()? > > I am a little fuzzy on how the btree stuff works, but it seems to me > that you are looking for transactions that both have an xmin before > some threshold and also hold an AccessShareLock on some relation. > GetLockConflicts() will provide the latter, at least. Thanks for taking a look. For the purpose of B-tree page reuse, we don't directly care about the xmin of any active snapshot. We need only prove that no active scan is paused adjacent to the page, holding a right-link to it. We currently achieve that wait-free by first marking the page with the next available xid and then reusing it when that mark (btpo.xact) predates the oldest running xid (RecentXmin). (At the moment, I'm failing to work out why this is OK with scans from transactions that haven't allocated an xid, but I vaguely recall convincing myself it was fine at one point.) It would indeed also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock) and check whether any of the returned transactions have PGPROC.xmin below the mark. That's notably more expensive than just comparing RecentXmin, so I'm not sure how well it would pay off overall. However, it could only help us on the master. (Not strictly true, but any way I see to extend it to the standby has critical flaws.) On the master, we can see a conflicting transaction and put off reusing the page. By the time the record hits the standby, we have to apply it, and we might have a running transaction that will hold a lock on the index for the next, say, 72 hours. At such times, vacuum_defer_cleanup_age or hot_standby_feedback ought to prevent the recovery stall. This did lead me to realize that what we do in this regard on the standby can be considerably independent from what we do on the master. If fruitful, the standby can prove the absence of a scan holding a right-link in a completely different fashion. So, we *could* take the cleanup-lock approach on the standby without changing very much on the master. nm
On Sat, Jun 11, 2011 at 11:40 PM, Noah Misch <noah@leadboat.com> wrote: > We currently achieve that wait-free by first marking the page with the next > available xid and then reusing it when that mark (btpo.xact) predates the > oldest running xid (RecentXmin). (At the moment, I'm failing to work out why > this is OK with scans from transactions that haven't allocated an xid, but I > vaguely recall convincing myself it was fine at one point.) It would indeed > also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock) > and check whether any of the returned transactions have PGPROC.xmin below the > mark. That's notably more expensive than just comparing RecentXmin, so I'm > not sure how well it would pay off overall. However, it could only help us on > the master. (Not strictly true, but any way I see to extend it to the standby > has critical flaws.) On the master, we can see a conflicting transaction and > put off reusing the page. By the time the record hits the standby, we have to > apply it, and we might have a running transaction that will hold a lock on the > index for the next, say, 72 hours. At such times, vacuum_defer_cleanup_age or > hot_standby_feedback ought to prevent the recovery stall. > > This did lead me to realize that what we do in this regard on the standby can > be considerably independent from what we do on the master. If fruitful, the > standby can prove the absence of a scan holding a right-link in a completely > different fashion. So, we *could* take the cleanup-lock approach on the > standby without changing very much on the master. Well, I'm generally in favor of trying to fix this problem without changing what the master does. It's a weakness of our replication technology that the standby has no better way to cope with a cleanup operation on the master than to start killing queries, but then again it's a weakness of our MVCC technology that we don't reuse space quickly enough and end up with bloat. I hear a lot more complaints about the second weakness than I do about the first. At any rate, if taking a cleanup lock on the right-linked page on the standby is sufficient to fix the problem, that seems like a far superior solution in any case. Presumably the frequency of someone having a pin on that particular page will be far lower than any matching based on XID or heavyweight locks. And the vast majority of such pins should disappear before the startup process feels obliged to get out its big hammer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 12, 2011 at 12:15:29AM -0400, Robert Haas wrote: > On Sat, Jun 11, 2011 at 11:40 PM, Noah Misch <noah@leadboat.com> wrote: > > We currently achieve that wait-free by first marking the page with the next > > available xid and then reusing it when that mark (btpo.xact) predates the > > oldest running xid (RecentXmin). ?(At the moment, I'm failing to work out why > > this is OK with scans from transactions that haven't allocated an xid, but I > > vaguely recall convincing myself it was fine at one point.) ?It would indeed > > also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock) > > and check whether any of the returned transactions have PGPROC.xmin below the > > mark. ?That's notably more expensive than just comparing RecentXmin, so I'm > > not sure how well it would pay off overall. ?However, it could only help us on > > the master. ?(Not strictly true, but any way I see to extend it to the standby > > has critical flaws.) ?On the master, we can see a conflicting transaction and > > put off reusing the page. ?By the time the record hits the standby, we have to > > apply it, and we might have a running transaction that will hold a lock on the > > index for the next, say, 72 hours. ?At such times, vacuum_defer_cleanup_age or > > hot_standby_feedback ought to prevent the recovery stall. > > > > This did lead me to realize that what we do in this regard on the standby can > > be considerably independent from what we do on the master. ?If fruitful, the > > standby can prove the absence of a scan holding a right-link in a completely > > different fashion. ?So, we *could* take the cleanup-lock approach on the > > standby without changing very much on the master. > > Well, I'm generally in favor of trying to fix this problem without > changing what the master does. It's a weakness of our replication > technology that the standby has no better way to cope with a cleanup > operation on the master than to start killing queries, but then again > it's a weakness of our MVCC technology that we don't reuse space > quickly enough and end up with bloat. I hear a lot more complaints > about the second weakness than I do about the first. I fully agree. That said, if this works on the standby, we may as well also use it opportunistically on the master, to throttle bloat. > At any rate, if taking a cleanup lock on the right-linked page on the > standby is sufficient to fix the problem, that seems like a far > superior solution in any case. Presumably the frequency of someone > having a pin on that particular page will be far lower than any > matching based on XID or heavyweight locks. And the vast majority of > such pins should disappear before the startup process feels obliged to > get out its big hammer. Yep; looks promising. Does such a thing have a chance of being backpatchable? I think the chances start slim and fall almost to zero on account of the difficulty of avoiding a WAL format change. Assuming that conclusion, I do think it's worth starting with something simple, even if it means additional bloat on the master in the wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. In choosing those settings, the administrator has taken constructive steps to accept master-side bloat in exchange for delaying recovery conflict. What's your opinion? Thanks, nm
On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: > I fully agree. That said, if this works on the standby, we may as well also use > it opportunistically on the master, to throttle bloat. As long as the performance cost is de minimis, I agree. >> At any rate, if taking a cleanup lock on the right-linked page on the >> standby is sufficient to fix the problem, that seems like a far >> superior solution in any case. Presumably the frequency of someone >> having a pin on that particular page will be far lower than any >> matching based on XID or heavyweight locks. And the vast majority of >> such pins should disappear before the startup process feels obliged to >> get out its big hammer. > > Yep; looks promising. > > Does such a thing have a chance of being backpatchable? I think the chances > start slim and fall almost to zero on account of the difficulty of avoiding a > WAL format change. I can't see back-patching it. Maybe people would feel it was worth considering if we were getting legions of complaints about this problem, but so far you're the only one. In any case, back-patching a WAL format change is a complete non-starter -- we can't go making minor versions non-interoperable. > Assuming that conclusion, I do think it's worth starting > with something simple, even if it means additional bloat on the master in the > wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. > In choosing those settings, the administrator has taken constructive steps to > accept master-side bloat in exchange for delaying recovery conflict. What's > your opinion? I'm pretty disinclined to go tinkering with 9.1 at this point, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: >> I fully agree. That said, if this works on the standby, we may as well also use >> it opportunistically on the master, to throttle bloat. > > As long as the performance cost is de minimis, I agree. > >>> At any rate, if taking a cleanup lock on the right-linked page on the >>> standby is sufficient to fix the problem, that seems like a far >>> superior solution in any case. Presumably the frequency of someone >>> having a pin on that particular page will be far lower than any >>> matching based on XID or heavyweight locks. And the vast majority of >>> such pins should disappear before the startup process feels obliged to >>> get out its big hammer. >> >> Yep; looks promising. >> >> Does such a thing have a chance of being backpatchable? I think the chances >> start slim and fall almost to zero on account of the difficulty of avoiding a >> WAL format change. > > I can't see back-patching it. Maybe people would feel it was worth > considering if we were getting legions of complaints about this > problem, but so far you're the only one. In any case, back-patching a > WAL format change is a complete non-starter -- we can't go making > minor versions non-interoperable. > >> Assuming that conclusion, I do think it's worth starting >> with something simple, even if it means additional bloat on the master in the >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. >> In choosing those settings, the administrator has taken constructive steps to >> accept master-side bloat in exchange for delaying recovery conflict. What's >> your opinion? > > I'm pretty disinclined to go tinkering with 9.1 at this point, too. Not least because a feature already exists in 9.1 to cope with this problem: hot standby feedback. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 9, 2011 at 10:38 PM, Noah Misch <noah@leadboat.com> wrote: > On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote: >> On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: >> > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: >> > > On 12.03.2011 12:40, Noah Misch wrote: >> > >> The installation that inspired my original report recently upgraded from 9.0.1 >> > >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The >> > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records. >> > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've >> > >> attached a test script demonstrating the behavior. _bt_page_recyclable approves >> > >> any page deleted no more recently than RecentXmin, because we need only ensure >> > >> that every ongoing scan has witnessed the page as dead. For the hot standby >> > >> case, we need to account for possibly-ongoing standby transactions. Using >> > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need >> > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N) >> > >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay >> > >> off, though. Thoughts? >> > > >> > > Hmm, instead of bloating the master, I wonder if we could detect more >> > > accurately if there are any on-going scans, in the standby. For example, >> > > you must hold a lock on the index to scan it, so only transactions >> > > holding the lock need to be checked for conflict. >> > >> > That would be nice. Do you have an outline of an implementation in mind? >> >> In an attempt to resuscitate this thread, here's my own shot at that. Apologies >> in advance if it's just an already-burning straw man. > [full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net] > > Anyone care to comment? On this system, which has vacuum_defer_cleanup_age set > to 3 peak hours worth of xid consumption, the problem caps recovery conflict > hold off at 10-20 minutes. It will have the same effect on standby feedback in > 9.1. I think we should start by using RecentGlobalXmin instead of RecentXmin as > the reuse horizon when wal_level = hot_standby, and backpatch that. Then, > independently consider for master a bloat-avoidance improvement like I outlined > most recently; I'm not sure whether that's worth it. In any event, I'm hoping > to get some consensus on the way forward. I like your ideas. (Also, I note that using xids in this way unnecessarily keeps bloat around for a long time, if we have periods of mostly read-only activity. Interesting point.) I think we would only get away with this approach on leaf pages of the index. It doesn't seem worth trying for the locks if we were higher up. On the standby side, its possible this could generate additional buffer pin deadlocks and/or contention. So I would also want to look at some deferral mechanism, so that we can mark the block removed, but not actually do so until some time later, or we really need to, for example when we write new data to that page. Got time for a patch in this coming CF? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote: > On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: > >> Assuming that conclusion, I do think it's worth starting > >> with something simple, even if it means additional bloat on the master in the > >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. > >> In choosing those settings, the administrator has taken constructive steps to > >> accept master-side bloat in exchange for delaying recovery conflict. ?What's > >> your opinion? > > > > I'm pretty disinclined to go tinkering with 9.1 at this point, too. > > Not least because a feature already exists in 9.1 to cope with this > problem: hot standby feedback. A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the time of issue (see _bt_page_recyclable()). Neither hot_standby_feedback nor vacuum_defer_cleanup_age affect RecentXmin. Therefore, neither facility delays conflicts arising directly from B-tree page reuse. See attached test script, which yields a snapshot conflict despite active hot_standby_feedback. Thanks, nm
Attachment
On Mon, Jun 13, 2011 at 04:41:11PM +0100, Simon Riggs wrote: > On Thu, Jun 9, 2011 at 10:38 PM, Noah Misch <noah@leadboat.com> wrote: > > On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote: > >> In an attempt to resuscitate this thread, here's my own shot at that. ?Apologies > >> in advance if it's just an already-burning straw man. > > [full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net] > > > > Anyone care to comment? ?On this system, which has vacuum_defer_cleanup_age set > > to 3 peak hours worth of xid consumption, the problem caps recovery conflict > > hold off at 10-20 minutes. ?It will have the same effect on standby feedback in > > 9.1. ?I think we should start by using RecentGlobalXmin instead of RecentXmin as > > the reuse horizon when wal_level = hot_standby, and backpatch that. ?Then, > > independently consider for master a bloat-avoidance improvement like I outlined > > most recently; I'm not sure whether that's worth it. ?In any event, I'm hoping > > to get some consensus on the way forward. > > I like your ideas. > > (Also, I note that using xids in this way unnecessarily keeps bloat > around for a long time, if we have periods of mostly read-only > activity. Interesting point.) Thanks. Yes, the current approach can mean two long-running-transaction lifetimes before full space reclamation. Would be nice to avoid. > I think we would only get away with this approach on leaf pages of the > index. It doesn't seem worth trying for the locks if we were higher > up. Don't we only delete leaf pages? (And, one might say, pages that have become leaves by having all children deleted.) > On the standby side, its possible this could generate additional > buffer pin deadlocks and/or contention. So I would also want to look > at some deferral mechanism, so that we can mark the block removed, but > not actually do so until some time later, or we really need to, for > example when we write new data to that page. That could be handy. I do wonder what high pin-hold durations arise regularly in the field. Preventing standby buffer pin deadlocks would be a nice win in any case. > Got time for a patch in this coming CF? Not for the June CF, unfortunately. Making a suitable test suite will be a large portion of the work. The logic in question closes race conditions that probably only arise under the heaviest field concurrency, so I'll need to systematically visit every concurrency variation to have confidence that it's correct. nm
On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote: > On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote: >> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: >> >> Assuming that conclusion, I do think it's worth starting >> >> with something simple, even if it means additional bloat on the master in the >> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. >> >> In choosing those settings, the administrator has taken constructive steps to >> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's >> >> your opinion? >> > >> > I'm pretty disinclined to go tinkering with 9.1 at this point, too. >> >> Not least because a feature already exists in 9.1 to cope with this >> problem: hot standby feedback. > > A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the > accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the > time of issue (see _bt_page_recyclable()). Neither hot_standby_feedback nor > vacuum_defer_cleanup_age affect RecentXmin. Therefore, neither facility delays > conflicts arising directly from B-tree page reuse. See attached test script, > which yields a snapshot conflict despite active hot_standby_feedback. OK, agreed. Bug. Good catch, Noah. Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby mode, so that it is under user control. Attached patch will be applied to head and backpatched to 9.1 and 9.0 to fix this. No effect on non-users of Hot Standby. Minimal invasive for HS users. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Jun 16, 2011 at 12:02:47AM +0100, Simon Riggs wrote: > On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote: > > On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote: > >> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: > >> >> Assuming that conclusion, I do think it's worth starting > >> >> with something simple, even if it means additional bloat on the master in the > >> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. > >> >> In choosing those settings, the administrator has taken constructive steps to > >> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's > >> >> your opinion? > >> > > >> > I'm pretty disinclined to go tinkering with 9.1 at this point, too. > >> > >> Not least because a feature already exists in 9.1 to cope with this > >> problem: hot standby feedback. > > > > A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the > > accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the > > time of issue (see _bt_page_recyclable()). ?Neither hot_standby_feedback nor > > vacuum_defer_cleanup_age affect RecentXmin. ?Therefore, neither facility delays > > conflicts arising directly from B-tree page reuse. ?See attached test script, > > which yields a snapshot conflict despite active hot_standby_feedback. > > OK, agreed. Bug. Good catch, Noah. > > Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby > mode, so that it is under user control. > > Attached patch will be applied to head and backpatched to 9.1 and 9.0 > to fix this. Thanks. We still hit a conflict when btpo.xact == RecentGlobalXmin and the standby has a transaction older than any master transaction. This happens because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid exactly is that of the oldest standby transaction (line numbers as of git cb94db91b). I only know this because the test script from my last message hits this case; it might never get hit in real usage. Still, seems like a hole not worth leaving. I think the most-correct fix is to TransactionIdRetreat the btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid. btpo.xact is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin. nm
On Thu, Jun 16, 2011 at 3:47 PM, Noah Misch <noah@leadboat.com> wrote: > On Thu, Jun 16, 2011 at 12:02:47AM +0100, Simon Riggs wrote: >> On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote: >> > On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote: >> >> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote: >> >> >> Assuming that conclusion, I do think it's worth starting >> >> >> with something simple, even if it means additional bloat on the master in the >> >> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case. >> >> >> In choosing those settings, the administrator has taken constructive steps to >> >> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's >> >> >> your opinion? >> >> > >> >> > I'm pretty disinclined to go tinkering with 9.1 at this point, too. >> >> >> >> Not least because a feature already exists in 9.1 to cope with this >> >> problem: hot standby feedback. >> > >> > A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the >> > accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the >> > time of issue (see _bt_page_recyclable()). ?Neither hot_standby_feedback nor >> > vacuum_defer_cleanup_age affect RecentXmin. ?Therefore, neither facility delays >> > conflicts arising directly from B-tree page reuse. ?See attached test script, >> > which yields a snapshot conflict despite active hot_standby_feedback. >> >> OK, agreed. Bug. Good catch, Noah. >> >> Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby >> mode, so that it is under user control. >> >> Attached patch will be applied to head and backpatched to 9.1 and 9.0 >> to fix this. > > Thanks. We still hit a conflict when btpo.xact == RecentGlobalXmin and the > standby has a transaction older than any master transaction. This happens > because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid > exactly is that of the oldest standby transaction (line numbers as of git > cb94db91b). I only know this because the test script from my last message hits > this case; it might never get hit in real usage. Still, seems like a hole not > worth leaving. I think the most-correct fix is to TransactionIdRetreat the > btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid. btpo.xact > is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin. I think you are pointing out another bug, rather than a problem in my last commit. The bug was caused by assuming that the xid is a "latestRemovedXid", as is the case in the rest of Hot Standby, which masks the off-by-one error through poor use of terms. I agree with your suggested fix. Thanks again. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Jun 16, 2011 at 4:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I agree with your suggested fix. Please ignore the previous patch, which was sent in error. Here's the fix. I'll apply this tomorrow morning if we all still agree. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Jun 16, 2011 at 04:53:41PM +0100, Simon Riggs wrote: > On Thu, Jun 16, 2011 at 3:47 PM, Noah Misch <noah@leadboat.com> wrote: > > Thanks. ?We still hit a conflict when btpo.xact == RecentGlobalXmin and the > > standby has a transaction older than any master transaction. ?This happens > > because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid > > exactly is that of the oldest standby transaction (line numbers as of git > > cb94db91b). ?I only know this because the test script from my last message hits > > this case; it might never get hit in real usage. ?Still, seems like a hole not > > worth leaving. ?I think the most-correct fix is to TransactionIdRetreat the > > btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid. ?btpo.xact > > is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin. > > I think you are pointing out another bug, rather than a problem in my > last commit. > > The bug was caused by assuming that the xid is a "latestRemovedXid", > as is the case in the rest of Hot Standby, which masks the off-by-one > error through poor use of terms. Exactly. > I agree with your suggested fix. (Note that TransactionIdRetreat mutates its argument and returns nothing.) Thanks, nm