Thread: Interesting tight loop

Interesting tight loop

From
Theo Schlossnagle
Date:
In production today (8.1.4), I ran into a backend process that  
wouldn't cancel right away -- minutes went by.

It was in

[0] TransactionIdIsCurrentTransactionId
[1] HeapTupleSatisfiesSnapshot
...

But the interesting thing is that there were 4.6 million elements in  
the s->childXids list.  Which is why it took so damn long.  I can't  
quite figure out how I induced this state.  It is an OLAP server with  
about 10-20 connection that run "long" queries (from 5 seconds to 24  
hours).

If this is a common possible state, it seems that perhaps a hash of  
the childXids would be more appropriate.  Does the order of the  
childXids chained off the current transaction state matter?  Most of  
the placed I could find that reference it seem to just attempt to  
find an Xid in there.  O(1) sounds a lot better than O(n)  :-D

Best regards,

Theo

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/




Re: Interesting tight loop

From
Gregory Stark
Date:
Theo Schlossnagle <jesus@omniti.com> writes:

> But the interesting thing is that there were 4.6 million elements in the
> s->childXids list.  Which is why it took so damn long.  I can't  quite figure
> out how I induced this state.  It is an OLAP server with  about 10-20
> connection that run "long" queries (from 5 seconds to 24  hours).

Wow, I think that means you've had 4.6 million subtransactions within the
current transaction. Is this a PL/PgSQL function or a script that uses
SAVEPOINT a lot?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Interesting tight loop

From
Theo Schlossnagle
Date:
On Sep 14, 2006, at 7:03 AM, Gregory Stark wrote:

> Theo Schlossnagle <jesus@omniti.com> writes:
>
>> But the interesting thing is that there were 4.6 million elements  
>> in the
>> s->childXids list.  Which is why it took so damn long.  I can't   
>> quite figure
>> out how I induced this state.  It is an OLAP server with  about 10-20
>> connection that run "long" queries (from 5 seconds to 24  hours).
>
> Wow, I think that means you've had 4.6 million subtransactions  
> within the
> current transaction. Is this a PL/PgSQL function or a script that uses
> SAVEPOINT a lot?

We don't use savepoint's too much.  Maybe one or two across out 1k or  
so pl/pgsql procs.

We use dbi-link which is plperl.  Perhaps that is somehow creating  
subtransactions?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/




Re: Interesting tight loop

From
Gregory Stark
Date:
Theo Schlossnagle <jesus@omniti.com> writes:

> We don't use savepoint's too much.  Maybe one or two across out 1k or so
> pl/pgsql procs.

Well if they're in a loop...

> We use dbi-link which is plperl.  Perhaps that is somehow creating
> subtransactions?

Ok, I more or less see what's going on. plperl creates a subtransaction
whenever you execute an SPI query from inside a perl function. That's so that
errors in the query can throw perl exceptions and be caught in the perl code.

So if your DBI source is an SPI connection (and not a connection to some other
database source) you will get a subtransaction for every remote_select() call.

In addition, dbi-link seems to do its work by creating a trigger which fires
once for every record you modify in its "shadow table". I'm not sure what
you're doing with those records but if your sending them on via an SPI
connection to another table you'll get a subtransaction every time the trigger
fires.

It would be interesting to know which of these it is because in the former
case it may be something that could be fixed. We only really need to remember
subtransactions that have hit disk. But I rather suspect it's the latter case
since it's easy to see you firing a trigger 4.3M times.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Interesting tight loop

From
Gregory Stark
Date:
Gregory Stark <stark@enterprisedb.com> writes:

> Ok, I more or less see what's going on. plperl creates a subtransaction
> whenever you execute an SPI query from inside a perl function. That's so that
> errors in the query can throw perl exceptions and be caught in the perl code.

It might also be worthwhile modifying plperl to check $^S which indicates
whether exceptions are going to be caught anywhere. If they're not going to be
caught would it be possible to skip creating the nested transaction and allow
them to force the whole transaction to be rolled back?

I'm not clear though if there would there be any internal perl interpreter
state that would get messed up if we just longjmp out from within perl code
rather than allow perl exceptions to get us out.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Interesting tight loop

From
Theo Schlossnagle
Date:
On Sep 14, 2006, at 8:19 AM, Gregory Stark wrote:

> Theo Schlossnagle <jesus@omniti.com> writes:
>
>> We don't use savepoint's too much.  Maybe one or two across out 1k  
>> or so
>> pl/pgsql procs.
>
> Well if they're in a loop...
>
>> We use dbi-link which is plperl.  Perhaps that is somehow creating
>> subtransactions?
>
> Ok, I more or less see what's going on. plperl creates a  
> subtransaction
> whenever you execute an SPI query from inside a perl function.  
> That's so that
> errors in the query can throw perl exceptions and be caught in the  
> perl code.
>
> So if your DBI source is an SPI connection (and not a connection to  
> some other
> database source) you will get a subtransaction for every  
> remote_select() call.
>
> In addition, dbi-link seems to do its work by creating a trigger  
> which fires
> once for every record you modify in its "shadow table". I'm not  
> sure what
> you're doing with those records but if your sending them on via an SPI
> connection to another table you'll get a subtransaction every time  
> the trigger
> fires.
>
> It would be interesting to know which of these it is because in the  
> former
> case it may be something that could be fixed. We only really need  
> to remember
> subtransactions that have hit disk. But I rather suspect it's the  
> latter case
> since it's easy to see you firing a trigger 4.3M times.

My remote_select() in DBI does a RETURN NEXT $row;  You think that  
might be the problem?  If that's the case -- that needs to be fixed.   
The metalevel of the remote_select is:

remote_select(query) {  handle = remote.prepare(query)  handle.execute;  while(row = handle.fetchrow_hashref) {
return_next$row;  }  handle.close;  return;
 
}

If that return_next is causing an subtransaction that would explain  
my world of pain well.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/




Re: Interesting tight loop

From
Tom Lane
Date:
Theo Schlossnagle <jesus@omniti.com> writes:
> In production today (8.1.4), I ran into a backend process that  
> wouldn't cancel right away -- minutes went by.

> It was in

> [0] TransactionIdIsCurrentTransactionId
> [1] HeapTupleSatisfiesSnapshot
> ...

> But the interesting thing is that there were 4.6 million elements in  
> the s->childXids list.  Which is why it took so damn long.

Well, I don't think TransactionIdIsCurrentTransactionId() itself is to
blame --- the loop over childXids is about as tight as such a loop could
possibly be (two indirect fetches and a comparison per iteration).
Even with 4.6M child Xids it could hardly take a second on any modern
machine.  I'm not inclined to add a CHECK_FOR_INTERRUPTS there.  The
problem should instead be blamed on something further down the call
stack ... did you save the full stack by any chance?
        regards, tom lane