Thread: advance local xmin more aggressively

advance local xmin more aggressively

From
Jeff Davis
Date:
With the new snapshot maintenance code, it looks like we can advance the
xmin more aggressively.

For instance:

S1:
INSERT INTO foo VALUES(1);

S2:
BEGIN;
DECLARE c1 CURSOR FOR SELECT i FROM foo;

S1:
DELETE FROM foo;

S2:
DECLARE c2 CURSOR FOR SELECT i FROM foo;
CLOSE c1;

S1:
VACUUM VERBOSE foo;

The VACUUM should be able to clean up the deleted tuple, because it's no
longer visible to anyone.

Attached a small patch to accomplish this. I don't expect it to be put
in 8.4, but it's small enough that I thought I should at least send it
in just in case.

Regards,
    Jeff Davis

Attachment

Re: advance local xmin more aggressively

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> With the new snapshot maintenance code, it looks like we can advance the
> xmin more aggressively.

The original design for that contemplated having snapmgr.c track
all the snapshots (cf the comment for RegisteredSnapshots).  I don't
particularly care for having it assume that it can find all the resource
owners.

But really the more important problem is to demonstrate that you
actually get a benefit commensurate with the additional cycles spent.
IIRC the reason the code is the way it is is that we concluded that for
typical usage patterns there wouldn't be any win from tracking things
more aggressively.  As somebody pointed out recently, SnapshotResetXmin
is called quite a lot; if it's expensive it's going to be a problem.
        regards, tom lane


Re: advance local xmin more aggressively

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > With the new snapshot maintenance code, it looks like we can advance the
> > xmin more aggressively.
> 
> The original design for that contemplated having snapmgr.c track
> all the snapshots (cf the comment for RegisteredSnapshots).  I don't
> particularly care for having it assume that it can find all the resource
> owners.
> 
> But really the more important problem is to demonstrate that you
> actually get a benefit commensurate with the additional cycles spent.
> IIRC the reason the code is the way it is is that we concluded that for
> typical usage patterns there wouldn't be any win from tracking things
> more aggressively.  As somebody pointed out recently, SnapshotResetXmin
> is called quite a lot; if it's expensive it's going to be a problem.

I think Jeff is coming from the Truviso point of view: they have very
long running transactions, and they normally have a number of snapshots
that are always being updated, but it's rare that there are no snapshots
at all.  So this optimization actually buys a chance to update Xmin at
all; with the current code, they keep the same xmin all the time because
there's always some snapshot.

I'm not sure if the best answer is to just state that Truviso should
keep maintaining this patch privately.  It would be, of course, much
better to come up with a way to keep track of this in a cheaper way.

For example, maybe we could keep track of counts of snapshots removed
since the last xmin calculation, and only run this routine if the number
is different from zero (or some small positive integer).


Re: advance local xmin more aggressively

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> For example, maybe we could keep track of counts of snapshots removed
> since the last xmin calculation, and only run this routine if the number
> is different from zero (or some small positive integer).

I think most of the callers of SnapshotResetXmin already know they
removed something.

It might be interesting for FreeSnapshot or something nearby to note
whether the snapshot being killed has xmin = proc's xmin, and only do
the update calculation if so.

I still dislike the assumption that all resource owners are children of
a known owner.  I suspect in fact that it's demonstrably wrong right
now, let alone in future (cf comments in PortalRun).  If we're going to
do this then snapmgr.c needs to track the snapshots for itself.  Of
course that's going to make the "is it worth it" question even more
pressing.
        regards, tom lane


Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Tue, Feb 10, 2009 at 3:06 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> With the new snapshot maintenance code, it looks like we can advance the
> xmin more aggressively.

Can you clarify the circumstances in which this patch would show a
benefit over the current system?

...Robert


Re: advance local xmin more aggressively

From
Jeff Davis
Date:
On Wed, 2009-02-11 at 10:20 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2009 at 3:06 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > With the new snapshot maintenance code, it looks like we can advance the
> > xmin more aggressively.
> 
> Can you clarify the circumstances in which this patch would show a
> benefit over the current system?

In the current code, if the process is always holding at least one
snapshot, the process' xmin never advances. That means VACUUM will never
be able to reclaim tuples visible during the first snapshot taken during
the transaction.

With the patch, as long as snapshots are being released, the process'
xmin will keep advancing to reflect the oldest snapshot currently held
by that process.

In order to accomplish that, every time a snapshot is released I have to
look at every snapshot that the process still holds to find the new
local minimum xmin. The current code will only change the process' xmin
if there are no snapshots at all.

As Tom pointed out, one of the assumptions I made writing the patch is
not always true. I am still trying to determine the implications of
that.

Regards,Jeff Davis



Re: advance local xmin more aggressively

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2009-02-11 at 10:20 -0500, Robert Haas wrote:
>> Can you clarify the circumstances in which this patch would show a
>> benefit over the current system?

> In the current code, if the process is always holding at least one
> snapshot, the process' xmin never advances.

Right, and the question is the scope of the circumstances in which
that's the case and your patch makes things better.  I believe that

* a process outside a transaction has no snapshots, so your patch
won't change anything

* a process in a serializable transaction will hold the serializable
snapshot till end of xact, so your patch won't change anything

* a process in a read-committed transaction will typically hold
snapshot(s) for the duration of each query, and then release them
all, so your patch won't change anything

You pointed out the case of opening a cursor in a read-committed
transaction, and then closing it before end of transaction, as a
place where your patch could hope to improve matters.  Are there
others?  Could your application be made to close that cursor before
opening another one (so that its set of open snapshots momentarily
goes to zero)?  It seems like the use case for more complex
bookkeeping for open snapshots is a tad narrow.
        regards, tom lane


Re: advance local xmin more aggressively

From
Jeff Davis
Date:
On Wed, 2009-02-11 at 12:20 -0500, Tom Lane wrote: 
> You pointed out the case of opening a cursor in a read-committed
> transaction, and then closing it before end of transaction, as a
> place where your patch could hope to improve matters.  Are there
> others?  Could your application be made to close that cursor before
> opening another one (so that its set of open snapshots momentarily
> goes to zero)?  It seems like the use case for more complex
> bookkeeping for open snapshots is a tad narrow.
> 

I'm approaching this from the perspective of our system at Truviso. I
used the cursor example to illustrate the issue in normal postgresql,
but our problem isn't directly related to cursors.

I don't have a compelling use case right now for normal postgresql,
because of the reasons you mention. However, at Truviso, we have to come
up with some kind of solution to this anyway. Ideally, I'd like to make
something that's acceptable to postgres in general (meaning no
performance impact or code ugliness).

I could imagine some situations that this could be more useful in the
future. For instance, Hot Standby will increase the consequences of not
advancing xmin when it's possible to do so.

Regards,Jeff Davis



Re: advance local xmin more aggressively

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I could imagine some situations that this could be more useful in the
> future. For instance, Hot Standby will increase the consequences of not
> advancing xmin when it's possible to do so.

The question wasn't really about the consequences; it was about whether
there was any hope of this patch being able to advance xmin more than
the code that's there, for common usage patterns.
        regards, tom lane


Re: advance local xmin more aggressively

From
Robert Haas
Date:
[ reviving a very old thread ]

On Tue, Feb 10, 2009 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> For example, maybe we could keep track of counts of snapshots removed
>> since the last xmin calculation, and only run this routine if the number
>> is different from zero (or some small positive integer).
>
> I think most of the callers of SnapshotResetXmin already know they
> removed something.
>
> It might be interesting for FreeSnapshot or something nearby to note
> whether the snapshot being killed has xmin = proc's xmin, and only do
> the update calculation if so.
>
> I still dislike the assumption that all resource owners are children of
> a known owner.  I suspect in fact that it's demonstrably wrong right
> now, let alone in future (cf comments in PortalRun).  If we're going to
> do this then snapmgr.c needs to track the snapshots for itself.  Of
> course that's going to make the "is it worth it" question even more
> pressing.

I've run into essentially the same problem Jeff originally complained
about with a large customer who has long-running transactions that
make extensive use of cursors.  Cursors are opened and closed over
time but it is rare for the number open to reach exactly zero, so what
ends up happening is that the backend xmin does not advance.  As you
can imagine, that doesn't work out well.

The approach I came up with initially was similar to Jeff's: start at
the topmost resource owner and traverse them all, visiting every
snapshot along the way.  But then I found this thread and saw the
comment that this might be "demonstrably wrong" and referring to the
comments in PortalRun.  Having reviewed those comments, which don't
seem to have changed much in the last five years, I can't understand
how they related to this issue.  It's true that the TopTransaction
resource owner could get swapped out under us during an internal
commit, but why should SnapshotResetXmin() have to care? It just
traverses the one that is in effect at the time it gets called.  The
only real danger I see here is that there could be *more than one*
toplevel resource owner.  I wonder if we could solve that problem by
adding a registry of active toplevel resource owners, so that if we
have a forest rather than a tree we can still find everything.

The problem I see with having snapmgr.c track the snapshots for itself
is that it is mostly duplicating of bookkeeping which is already being
done.  Since this problem doesn't affect the majority of users, it's
not desirable to add a lot of extra bookkeeping to cater to it - but
even if it did affect a lot of users, we still want it to be as cheap
as possible, and reusing the tracking that resource owners are already
doing seems like the way to get there.

I think there are a couple of things we can do to make this cheaper.
Suppose we keep track of the oldest xmin of any registered snapshot
and the number of registered snapshots that have that particular xmin.
Every time we deregister a snapshot, we check whether it is one of the
ones with the minimum xmin; if it is, we decrement the count.  When
the count reaches 0, we know that a traversal of all registered
snapshots is guaranteed to find a newer value to advertise in
MyProc->xmin; that way, we can arrange to do the work only when it's
likely to pay off.  In most cases this won't happen until the last
snapshot is unregistered, because our snapshots normally form a stack,
with newer snapshots having been taken later.  But if somebody
unregisters the oldest snapshot we'll immediately notice and
recalculate.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 12/05/2014 05:05 PM, Robert Haas wrote:
> [ reviving a very old thread ]
>
> On Tue, Feb 10, 2009 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> For example, maybe we could keep track of counts of snapshots removed
>>> since the last xmin calculation, and only run this routine if the number
>>> is different from zero (or some small positive integer).
>>
>> I think most of the callers of SnapshotResetXmin already know they
>> removed something.
>>
>> It might be interesting for FreeSnapshot or something nearby to note
>> whether the snapshot being killed has xmin = proc's xmin, and only do
>> the update calculation if so.
>>
>> I still dislike the assumption that all resource owners are children of
>> a known owner.  I suspect in fact that it's demonstrably wrong right
>> now, let alone in future (cf comments in PortalRun).  If we're going to
>> do this then snapmgr.c needs to track the snapshots for itself.  Of
>> course that's going to make the "is it worth it" question even more
>> pressing.
>
> I've run into essentially the same problem Jeff originally complained
> about with a large customer who has long-running transactions that
> make extensive use of cursors.  Cursors are opened and closed over
> time but it is rare for the number open to reach exactly zero, so what
> ends up happening is that the backend xmin does not advance.  As you
> can imagine, that doesn't work out well.
>
> The approach I came up with initially was similar to Jeff's: start at
> the topmost resource owner and traverse them all, visiting every
> snapshot along the way.  But then I found this thread and saw the
> comment that this might be "demonstrably wrong" and referring to the
> comments in PortalRun.  Having reviewed those comments, which don't
> seem to have changed much in the last five years, I can't understand
> how they related to this issue.  It's true that the TopTransaction
> resource owner could get swapped out under us during an internal
> commit, but why should SnapshotResetXmin() have to care? It just
> traverses the one that is in effect at the time it gets called.  The
> only real danger I see here is that there could be *more than one*
> toplevel resource owner.  I wonder if we could solve that problem by
> adding a registry of active toplevel resource owners, so that if we
> have a forest rather than a tree we can still find everything.

I don't immediately see the problem either, but I have to say that 
grovelling through all the resource owners seems ugly anyway. Resource 
owners are not meant to be traversed like that. And there could be a lot 
of them, and you have to visit every one of them. That could get 
expensive if there are a lot of resource owners.

BTW, you could easily detect that you haven't seen all the registered 
snapshots, after traversing the resource owner, as we keep the counter 
of them. So you could just fall back to not advancing the xmin if it 
happens.

> The problem I see with having snapmgr.c track the snapshots for itself
> is that it is mostly duplicating of bookkeeping which is already being
> done.  Since this problem doesn't affect the majority of users, it's
> not desirable to add a lot of extra bookkeeping to cater to it - but
> even if it did affect a lot of users, we still want it to be as cheap
> as possible, and reusing the tracking that resource owners are already
> doing seems like the way to get there.

I would prefer doing separate bookkeeping in snapmgr.c. It seems like it 
wouldn't be difficult to do. It has to be cheap, but I don't see any 
reason to believe that it'd be more expensive than traversing through 
all resource owners. A binary heap or some other priority queue 
implementation should work pretty well for this.

> I think there are a couple of things we can do to make this cheaper.
> Suppose we keep track of the oldest xmin of any registered snapshot
> and the number of registered snapshots that have that particular xmin.
> Every time we deregister a snapshot, we check whether it is one of the
> ones with the minimum xmin; if it is, we decrement the count.  When
> the count reaches 0, we know that a traversal of all registered
> snapshots is guaranteed to find a newer value to advertise in
> MyProc->xmin; that way, we can arrange to do the work only when it's
> likely to pay off.  In most cases this won't happen until the last
> snapshot is unregistered, because our snapshots normally form a stack,
> with newer snapshots having been taken later.  But if somebody
> unregisters the oldest snapshot we'll immediately notice and
> recalculate.

Yeah, that's a reasonable optimization. It's a reasonable optimization 
even if you do the bookkeeping in snapmgr.c.

And that optimization won't save you in the cases where it doesn't 
apply. For example, what if snapshots happen to form a queue, rather 
than a stack:

DECLARE c1 CURSOR FOR ...;
DECLARE c2 CURSOR FOR ...;
DECLARE c3 CURSOR FOR ...;
...
DECLARE c1000 CURSOR FOR ...;

CLOSE c1;
CLOSE c2;
CLOSE c3;
...
CLOSE c1000;

It's not hard to imagine an application doing that. Sure, you could 
rewrite it to close the cursors in different order, but on the face of 
it that's not an unreasonable thing for an application to do. I think we 
should avoid the O(n^2) behaviour in that case.

(But +1 for addressing this problem, with the extra bookkeeping in 
snapmgr.c)

- Heikki



Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I don't immediately see the problem either, but I have to say that
> grovelling through all the resource owners seems ugly anyway. Resource
> owners are not meant to be traversed like that. And there could be a lot of
> them, and you have to visit every one of them. That could get expensive if
> there are a lot of resource owners.

1. I don't really see why resource owners shouldn't be traversed like
that.  They are clearly intended to form a hierarchy, and there's
existing code that recurses through the hierarchy from a given level
downward.  What's ugly about that?

2. If you have a lot of resource owners, you probably have a lot of
snapshots, so walking a list will be expensive, too.  It will be
disproportionately expensive to walk the resource owner tree only if
there are lots of resource owners but very few of them have any
snapshots.  But I don't think that can really happen.  If you've got
lots of resource owners and each one has a snapshot, you'll traverse
~3 pointers per snapshot: ~1 to find the next ResourceOwner, 1 to find
the snapshot array, and 1 to reach the snapshot itself.  A non-inlined
list would traverse only 2 pointers per snapshot, but that doesn't
seem like enough of a difference to get excited about.

> BTW, you could easily detect that you haven't seen all the registered
> snapshots, after traversing the resource owner, as we keep the counter of
> them. So you could just fall back to not advancing the xmin if it happens.

Not a bad idea.  Or we could elog(FATAL) or fail an assertion if we
don't see them all, and then if it happens we call it a bug and fix
it.

> I would prefer doing separate bookkeeping in snapmgr.c. It seems like it
> wouldn't be difficult to do. It has to be cheap, but I don't see any reason
> to believe that it'd be more expensive than traversing through all resource
> owners. A binary heap or some other priority queue implementation should
> work pretty well for this.

That's optimizing for making the xmin recomputation cheap, but we
don't expect xmin recomputation to happen very often, so I'm not sure
that's the right trade-off.

> And that optimization won't save you in the cases where it doesn't apply.
> For example, what if snapshots happen to form a queue, rather than a stack:
>
> DECLARE c1 CURSOR FOR ...;
> DECLARE c2 CURSOR FOR ...;
> DECLARE c3 CURSOR FOR ...;
> ...
> DECLARE c1000 CURSOR FOR ...;
>
> CLOSE c1;
> CLOSE c2;
> CLOSE c3;
> ...
> CLOSE c1000;
>
> It's not hard to imagine an application doing that. Sure, you could rewrite
> it to close the cursors in different order, but on the face of it that's not
> an unreasonable thing for an application to do. I think we should avoid the
> O(n^2) behaviour in that case.

You won't actually get O(n^2) behavior here unless those DECLARE
CURSOR statements all get snapshots with different xmins; if many of
them have snapshots that share an xmin, then the optimization of
recomputing only when there are no snaps with a given xmin will save
you.  That's a bit pathological, but maybe we should try to cater to
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> I don't immediately see the problem either, but I have to say that
>> grovelling through all the resource owners seems ugly anyway. Resource
>> owners are not meant to be traversed like that. And there could be a lot of
>> them, and you have to visit every one of them. That could get expensive if
>> there are a lot of resource owners.
>
> 1. I don't really see why resource owners shouldn't be traversed like
> that.  They are clearly intended to form a hierarchy, and there's
> existing code that recurses through the hierarchy from a given level
> downward.  What's ugly about that?

Here's a patch.  I looked at the issue of tracking parent-less
resource owners a bit more closely.  It turns out that resource owners
are always created in TopMemoryContext "since they should only be
freed explicitly" (cf. resowner.c).  I was a bit worried about that,
because it would be bad to keep a list of all parent-less resource
owners if list elements could vanish in a context reset.  But that
doesn't seem to be a concern.  So I stole the nextchild links of
parent-less resource owners, which are not used for anything
currently, to keep a list of such resource owners.

It occurred to me that there's probably not much value in recomputing
xmin when the active snapshot stack is non-empty.  It's not impossible
that a PL/pgsql function could close a cursor with an old xmin and
then do lots of other work (or just sleep for a long time) before
returning to the top-level, but it is pretty unlikely.  So the
attached patch only considers recomputing the advertised xmin when the
active snapshot stack is empty.  That'll happen at the end of each
statement, which seems soon enough.

Upthread, I suggested keeping a tally of the number of snapshots with
the advertised xmin and recomputing the xmin to advertise only when it
reaches 0.  This patch doesn't implementation that optimization, but
it does have code that aborts the traversal of the resource owner
hierarchy as soon as we see an xmin that will preclude advancing our
advertised xmin.  Releasing N resource owners could therefore cost
O(N^2) in the worst case, but note that releasing N resource owners is
*already* an O(N^2) operation in the worst case, because the list of
children of a particular parent resource owner is singly linked, and
thus deleting a resource owner is O(N).  It's been that way for an
awfully long time without anyone complaining, probably because (a)
it's not particularly common to have large numbers of cursors open
simultaneously and (b) even if you do have that, the constant factor
is pretty low.

Following Heikki's previous suggestion, the patch contains checks to
make sure that we find exactly the number of registered snapshots that
we expect to find.  We could consider demoting these to asserts or
something, but this is more likely to catch bugs if there are any.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 12/09/2014 10:35 PM, Robert Haas wrote:
> On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> I don't immediately see the problem either, but I have to say that
>>> grovelling through all the resource owners seems ugly anyway. Resource
>>> owners are not meant to be traversed like that. And there could be a lot of
>>> them, and you have to visit every one of them. That could get expensive if
>>> there are a lot of resource owners.
>>
>> 1. I don't really see why resource owners shouldn't be traversed like
>> that.  They are clearly intended to form a hierarchy, and there's
>> existing code that recurses through the hierarchy from a given level
>> downward.  What's ugly about that?

I can't exactly point my finger on it, but it just feels wrong from a
modularity point of view. Their purpose is to make sure that we don't
leak resources on abort, by allowing easy an "release everything"
operation. It's not designed for finding objects based on some other
criteria.

There is similar double bookkeeping of many other things that are
tracked by resource owners. Heavy-weight locks are tracked by LOCALLOCK
structs, buffer pins in PrivateRefCount array etc. Those things existed
before resource owners were invented, but if we were starting from
scratch, that design would still make sense, as different objects have
different access criteria. fd.c needs to be able to find the
least-recently-used open file, for example, and you need to find the
snapshot with the lowest xmin.

> Upthread, I suggested keeping a tally of the number of snapshots with
> the advertised xmin and recomputing the xmin to advertise only when it
> reaches 0.  This patch doesn't implementation that optimization, but
> it does have code that aborts the traversal of the resource owner
> hierarchy as soon as we see an xmin that will preclude advancing our
> advertised xmin.  Releasing N resource owners could therefore cost
> O(N^2) in the worst case, but note that releasing N resource owners is
> *already* an O(N^2) operation in the worst case, because the list of
> children of a particular parent resource owner is singly linked, and
> thus deleting a resource owner is O(N).  It's been that way for an
> awfully long time without anyone complaining, probably because (a)
> it's not particularly common to have large numbers of cursors open
> simultaneously and (b) even if you do have that, the constant factor
> is pretty low.

I think you're confusing the N and the N above. It's true that deleting
a resource owner is O(M), where the M is the number of children of that
resource owner. It does not follow that releasing N resource owners is
O(N^2), where N is the number of resource owners released. Calling
ResourceOwnerDelete(x) will only visit each resource owner in that tree
once, so it's O(N), where N is the total number of resource owners in
the tree.

I did some testing of the worst case scenario. The attached script first
creates a lot of cursors, then a lot of savepoints, and finally closes
the cursors in FIFO order. When the number of savepoints is high enough,
this actually segfaults with your patch, because you run out of stack
space when recursing the subxact resource owners. That's hardly this
patch's fault, I'm actually surprised it doesn't crash without it,
because we recurse into all resource owners in ResourceOwnerRelease too.
Apparently the subxacts are closed in LIFO order at commit, but there
might be are other cases where you could trigger that. In any case, a
stack-depth check would be nice.

- Heikki


Attachment

Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 7:34 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>> 1. I don't really see why resource owners shouldn't be traversed like
>>> that.  They are clearly intended to form a hierarchy, and there's
>>> existing code that recurses through the hierarchy from a given level
>>> downward.  What's ugly about that?
>
> I can't exactly point my finger on it, but it just feels wrong from a
> modularity point of view. Their purpose is to make sure that we don't leak
> resources on abort, by allowing easy an "release everything" operation. It's
> not designed for finding objects based on some other criteria.
>
> There is similar double bookkeeping of many other things that are tracked by
> resource owners. Heavy-weight locks are tracked by LOCALLOCK structs, buffer
> pins in PrivateRefCount array etc. Those things existed before resource
> owners were invented, but if we were starting from scratch, that design
> would still make sense, as different objects have different access criteria.
> fd.c needs to be able to find the least-recently-used open file, for
> example, and you need to find the snapshot with the lowest xmin.

I think all of these are performance trade-offs rather than questions
of style.  LOCALLOCK structs and private buffer pin tracking existed
before resource owners because we need to do lookups of locks by lock
tag or buffers by buffer number frequently enough that, if we had to
grovel trough all the locks or buffer pins in the system without any
indexing, it would be slow.  We *could* do it that way now that we
have resource owners, but it's pretty obvious that it would suck.

That's less obvious here.  If we throw all of the registered snapshots
into a binary heap, finding the lowest xmin will be cheaper every time
we do it, but we'll have the overhead of maintaining that binary heap
even if it never gets used.

> I think you're confusing the N and the N above. It's true that deleting a
> resource owner is O(M), where the M is the number of children of that
> resource owner. It does not follow that releasing N resource owners is
> O(N^2), where N is the number of resource owners released. Calling
> ResourceOwnerDelete(x) will only visit each resource owner in that tree
> once, so it's O(N), where N is the total number of resource owners in the
> tree.

Not if the portals are dropped in separate commands with an
unpredictable ordering.  Then you have N separate calls to
ResourceOwnerDelete.

> I did some testing of the worst case scenario. The attached script first
> creates a lot of cursors, then a lot of savepoints, and finally closes the
> cursors in FIFO order. When the number of savepoints is high enough, this
> actually segfaults with your patch, because you run out of stack space when
> recursing the subxact resource owners. That's hardly this patch's fault, I'm
> actually surprised it doesn't crash without it, because we recurse into all
> resource owners in ResourceOwnerRelease too. Apparently the subxacts are
> closed in LIFO order at commit, but there might be are other cases where you
> could trigger that. In any case, a stack-depth check would be nice.

Thanks for the test case; that's helpful.

I added a stack depth check, but it doesn't help much:

ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
STATEMENT:  CLOSE cur0;
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
PANIC:  ERRORDATA_STACK_SIZE exceeded

That's at 100k subtransactions.  I took some performance data for
lower numbers of subtransactions:

-- at 1000 subtransactions --
master 0.156s 0.157s 0.154s
advance-xmin 0.177s 0.175s 0.177s

-- at 10000 subtransactions --
master 0.458s 0.457s 0.456s
advance-xmin 0.639s 0.637s 0.633

I guess this bears some further thought.  I certainly don't like the
fact that this makes the whole system crap out at a lower number of
subtransactions than presently.  The actual performance numbers don't
bother me very much; I'm comfortable with the possibility that closing
a cursor will be some modest percentage slower if you've got thousands
of active savepoints.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I guess this bears some further thought.  I certainly don't like the
> fact that this makes the whole system crap out at a lower number of
> subtransactions than presently.  The actual performance numbers don't
> bother me very much; I'm comfortable with the possibility that closing
> a cursor will be some modest percentage slower if you've got thousands
> of active savepoints.

Here's a new version with two changes:

1. I changed the traversal of the resource owner tree to iterate
instead of recurse.  It now does a depth-first, pre-order traversal of
the tree; when we reach the last child of a node, we follow its parent
pointer to get back to where we were.  That way, we don't need to keep
anything on the stack.  That fixed the crash at 100k cursors, but it
was still 4x slower.

2. Instead of traversing the tree until we find an xmin equal to the
one we're currently advertising, the code now traverses the entire
tree each time it runs. However, it also keeps a record of how many
times the oldest xmin occurred in the tree, which is decremented each
time we unregister a snapshot with that xmin; the traversal doesn't
run again until that count reaches 0.  That fixed the performance
regression on your test case.

With a million subtransactions:

master 34.464s 33.742s 34.317s
advance-xmin 34.516s 34.069s 34.196s

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 12/10/2014 06:56 PM, Robert Haas wrote:
> On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I guess this bears some further thought.  I certainly don't like the
>> fact that this makes the whole system crap out at a lower number of
>> subtransactions than presently.  The actual performance numbers don't
>> bother me very much; I'm comfortable with the possibility that closing
>> a cursor will be some modest percentage slower if you've got thousands
>> of active savepoints.
>
> Here's a new version with two changes:
>
> 1. I changed the traversal of the resource owner tree to iterate
> instead of recurse.  It now does a depth-first, pre-order traversal of
> the tree; when we reach the last child of a node, we follow its parent
> pointer to get back to where we were.  That way, we don't need to keep
> anything on the stack.  That fixed the crash at 100k cursors, but it
> was still 4x slower.

Clever. Could we use that method in ResourceOwnerReleaseInternal and 
ResourceOwnerDelete, too? Might be best to have a 
ResourceOwnerWalk(resowner, callback) function for walking all resource 
owners in a tree, instead of one for walking the snapshots in them.

> 2. Instead of traversing the tree until we find an xmin equal to the
> one we're currently advertising, the code now traverses the entire
> tree each time it runs. However, it also keeps a record of how many
> times the oldest xmin occurred in the tree, which is decremented each
> time we unregister a snapshot with that xmin; the traversal doesn't
> run again until that count reaches 0.  That fixed the performance
> regression on your test case.
>
> With a million subtransactions:
>
> master 34.464s 33.742s 34.317s
> advance-xmin 34.516s 34.069s 34.196s

Well, you can still get the slowness back by running other stuff in the 
background. I admit that it's a very obscure case, probably fine in 
practice. I would still feel better if snapmgr.c did its own 
bookkeeping, from an aesthetic point of view. In a heap, or even just a 
linked list, if the performance characteristics of that is acceptable.

It occurs to me that the pairing heap I just posted in another thread 
(http://www.postgresql.org/message-id/54886BB8.9040000@vmware.com) would 
be a good fit for this. It's extremely cheap to insert to and to find 
the minimum item (O(1)), and the delete operation is O(log N), 
amortized. I didn't implement a delete operation, for removing a 
particular node, I only did delete-min, but it's basically the same 
code. Using a pairing heap for this might be overkill, but if we have 
that implementation anyway, the code in snapmgr.c to use it would be 
very simple, so I see little reason not to. It might even be simpler 
than your patch, because you wouldn't need to have the heuristics on 
whether to attempt updating the xmin; it would be cheap enough to always 
try it.

- Heikki




Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Clever. Could we use that method in ResourceOwnerReleaseInternal and
> ResourceOwnerDelete, too? Might be best to have a
> ResourceOwnerWalk(resowner, callback) function for walking all resource
> owners in a tree, instead of one for walking the snapshots in them.

Sure.  It would be a little more complicated there since you want to
stop when you get back to the starting point, but not too bad.  But is
that solving any actual problem?

>> 2. Instead of traversing the tree until we find an xmin equal to the
>> one we're currently advertising, the code now traverses the entire
>> tree each time it runs. However, it also keeps a record of how many
>> times the oldest xmin occurred in the tree, which is decremented each
>> time we unregister a snapshot with that xmin; the traversal doesn't
>> run again until that count reaches 0.  That fixed the performance
>> regression on your test case.
>>
>> With a million subtransactions:
>>
>> master 34.464s 33.742s 34.317s
>> advance-xmin 34.516s 34.069s 34.196s
>
> Well, you can still get the slowness back by running other stuff in the
> background. I admit that it's a very obscure case, probably fine in
> practice. I would still feel better if snapmgr.c did its own bookkeeping,
> from an aesthetic point of view. In a heap, or even just a linked list, if
> the performance characteristics of that is acceptable.
>
> It occurs to me that the pairing heap I just posted in another thread
> (http://www.postgresql.org/message-id/54886BB8.9040000@vmware.com) would be
> a good fit for this. It's extremely cheap to insert to and to find the
> minimum item (O(1)), and the delete operation is O(log N), amortized. I
> didn't implement a delete operation, for removing a particular node, I only
> did delete-min, but it's basically the same code. Using a pairing heap for
> this might be overkill, but if we have that implementation anyway, the code
> in snapmgr.c to use it would be very simple, so I see little reason not to.
> It might even be simpler than your patch, because you wouldn't need to have
> the heuristics on whether to attempt updating the xmin; it would be cheap
> enough to always try it.

Care to code it up?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 12/10/2014 08:35 PM, Robert Haas wrote:
> On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Clever. Could we use that method in ResourceOwnerReleaseInternal and
>> ResourceOwnerDelete, too? Might be best to have a
>> ResourceOwnerWalk(resowner, callback) function for walking all resource
>> owners in a tree, instead of one for walking the snapshots in them.
>
> Sure.  It would be a little more complicated there since you want to
> stop when you get back to the starting point, but not too bad.  But is
> that solving any actual problem?

I thought that a transaction commit or abort in some special
circumstances might call ResourceOwnerReleaseInternal on the top level,
but I can't make it happen. The machinery in xact.c is too clever, and
always releases the resource owners from the bottom up. And I can't find
a way to create a deep resource owner tree in any other way. So I guess
it's fine as it is.

MemoryContextCheck and MemoryContextPrint also recurse, however.
MemoryContextCheck is only enabled with --enable-cassert, but
MemoryContextPrint is called when you run out of memory. That could turn
a plain "out of memory" error into a stack overrun, triggering a server
crash and restart.

>> It occurs to me that the pairing heap I just posted in another thread
>> (http://www.postgresql.org/message-id/54886BB8.9040000@vmware.com) would be
>> a good fit for this. It's extremely cheap to insert to and to find the
>> minimum item (O(1)), and the delete operation is O(log N), amortized. I
>> didn't implement a delete operation, for removing a particular node, I only
>> did delete-min, but it's basically the same code. Using a pairing heap for
>> this might be overkill, but if we have that implementation anyway, the code
>> in snapmgr.c to use it would be very simple, so I see little reason not to.
>> It might even be simpler than your patch, because you wouldn't need to have
>> the heuristics on whether to attempt updating the xmin; it would be cheap
>> enough to always try it.
>
> Care to code it up?

Here you are.

- Heikki

Attachment

Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 3:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Care to code it up?
>
> Here you are.

That was quick.

You need to add a semicolon to the end of line 20 in pairingheap.c.

I wonder what the worst case for this is.  I tried it on your
destruction test case from before and it doesn't seem to cost anything
material:

your patch: 0m38.714s 0m34.424s 0m35.191s
master: 0m35.260s 0m34.643s 0m34.945s
my patch: 0m34.728s 0m34.619s 0m35.015s

The first measurement with your patch is somewhat of an outlier, but
that was also the first measurement I took, which might have thrown
off the results.  Basically, either your patch or my patch looks free
from here.  I'm kind of suspicious about that: it doesn't seem like
the additional linked-list manipulation you're doing in
RegisterSnapshotOnOwner ought to cost something, but maybe that
function just isn't called enough for it to matter, at least on this
test case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Jeff Janes
Date:
On Wed, Dec 10, 2014 at 3:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 10, 2014 at 3:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Care to code it up?
>
> Here you are.

That was quick.

You need to add a semicolon to the end of line 20 in pairingheap.c.

In addition to the semicolon, it doesn't build under cassert.  There are some pairingheap_empty that need to be pairingheap_is_empty, and snapmgr.c needs an address of operator near line 355 and something is wrong in snapmgr.c near line 811.

Cheers,

Jeff

Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 12/16/2014 10:41 PM, Jeff Janes wrote:
> On Wed, Dec 10, 2014 at 3:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Dec 10, 2014 at 3:28 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>> Care to code it up?
>>>
>>> Here you are.
>>
>> That was quick.
>>
>> You need to add a semicolon to the end of line 20 in pairingheap.c.
>
> In addition to the semicolon, it doesn't build under cassert.  There are
> some pairingheap_empty that need to be pairingheap_is_empty, and snapmgr.c
> needs an address of operator near line 355 and something is wrong
> in snapmgr.c near line 811.

Here's an updated version, rebased over the pairing heap code that I
just committed, and fixing those bugs.

- Heikki


Attachment

Re: advance local xmin more aggressively

From
Michael Paquier
Date:
On Mon, Dec 22, 2014 at 7:31 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Here's an updated version, rebased over the pairing heap code that I just
> committed, and fixing those bugs.
So, are we reaching an outcome for the match happening here?
-- 
Michael



Re: advance local xmin more aggressively

From
Robert Haas
Date:
On Thu, Jan 15, 2015 at 3:08 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Dec 22, 2014 at 7:31 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Here's an updated version, rebased over the pairing heap code that I just
>> committed, and fixing those bugs.
> So, are we reaching an outcome for the match happening here?

Well, I still like using the existing ResourceOwner pointers to find
the snapshots more than introducing a separate data structure just for
that.  But I'm OK with Heikki committing his version and, if anybody
notices the new code becoming a hotspot, we can revisit the issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: advance local xmin more aggressively

From
Heikki Linnakangas
Date:
On 01/15/2015 09:57 PM, Robert Haas wrote:
> On Thu, Jan 15, 2015 at 3:08 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Dec 22, 2014 at 7:31 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> Here's an updated version, rebased over the pairing heap code that I just
>>> committed, and fixing those bugs.
>> So, are we reaching an outcome for the match happening here?
>
> Well, I still like using the existing ResourceOwner pointers to find
> the snapshots more than introducing a separate data structure just for
> that.  But I'm OK with Heikki committing his version and, if anybody
> notices the new code becoming a hotspot, we can revisit the issue.

Ok, committed.

- Heikki