Thread: Interesting tight loop
In production today (8.1.4), I ran into a backend process that wouldn't cancel right away -- minutes went by. It was in [0] TransactionIdIsCurrentTransactionId [1] HeapTupleSatisfiesSnapshot ... But the interesting thing is that there were 4.6 million elements in the s->childXids list. Which is why it took so damn long. I can't quite figure out how I induced this state. It is an OLAP server with about 10-20 connection that run "long" queries (from 5 seconds to 24 hours). If this is a common possible state, it seems that perhaps a hash of the childXids would be more appropriate. Does the order of the childXids chained off the current transaction state matter? Most of the placed I could find that reference it seem to just attempt to find an Xid in there. O(1) sounds a lot better than O(n) :-D Best regards, Theo // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes: > But the interesting thing is that there were 4.6 million elements in the > s->childXids list. Which is why it took so damn long. I can't quite figure > out how I induced this state. It is an OLAP server with about 10-20 > connection that run "long" queries (from 5 seconds to 24 hours). Wow, I think that means you've had 4.6 million subtransactions within the current transaction. Is this a PL/PgSQL function or a script that uses SAVEPOINT a lot? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Sep 14, 2006, at 7:03 AM, Gregory Stark wrote: > Theo Schlossnagle <jesus@omniti.com> writes: > >> But the interesting thing is that there were 4.6 million elements >> in the >> s->childXids list. Which is why it took so damn long. I can't >> quite figure >> out how I induced this state. It is an OLAP server with about 10-20 >> connection that run "long" queries (from 5 seconds to 24 hours). > > Wow, I think that means you've had 4.6 million subtransactions > within the > current transaction. Is this a PL/PgSQL function or a script that uses > SAVEPOINT a lot? We don't use savepoint's too much. Maybe one or two across out 1k or so pl/pgsql procs. We use dbi-link which is plperl. Perhaps that is somehow creating subtransactions? // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes: > We don't use savepoint's too much. Maybe one or two across out 1k or so > pl/pgsql procs. Well if they're in a loop... > We use dbi-link which is plperl. Perhaps that is somehow creating > subtransactions? Ok, I more or less see what's going on. plperl creates a subtransaction whenever you execute an SPI query from inside a perl function. That's so that errors in the query can throw perl exceptions and be caught in the perl code. So if your DBI source is an SPI connection (and not a connection to some other database source) you will get a subtransaction for every remote_select() call. In addition, dbi-link seems to do its work by creating a trigger which fires once for every record you modify in its "shadow table". I'm not sure what you're doing with those records but if your sending them on via an SPI connection to another table you'll get a subtransaction every time the trigger fires. It would be interesting to know which of these it is because in the former case it may be something that could be fixed. We only really need to remember subtransactions that have hit disk. But I rather suspect it's the latter case since it's easy to see you firing a trigger 4.3M times. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Ok, I more or less see what's going on. plperl creates a subtransaction > whenever you execute an SPI query from inside a perl function. That's so that > errors in the query can throw perl exceptions and be caught in the perl code. It might also be worthwhile modifying plperl to check $^S which indicates whether exceptions are going to be caught anywhere. If they're not going to be caught would it be possible to skip creating the nested transaction and allow them to force the whole transaction to be rolled back? I'm not clear though if there would there be any internal perl interpreter state that would get messed up if we just longjmp out from within perl code rather than allow perl exceptions to get us out. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Sep 14, 2006, at 8:19 AM, Gregory Stark wrote: > Theo Schlossnagle <jesus@omniti.com> writes: > >> We don't use savepoint's too much. Maybe one or two across out 1k >> or so >> pl/pgsql procs. > > Well if they're in a loop... > >> We use dbi-link which is plperl. Perhaps that is somehow creating >> subtransactions? > > Ok, I more or less see what's going on. plperl creates a > subtransaction > whenever you execute an SPI query from inside a perl function. > That's so that > errors in the query can throw perl exceptions and be caught in the > perl code. > > So if your DBI source is an SPI connection (and not a connection to > some other > database source) you will get a subtransaction for every > remote_select() call. > > In addition, dbi-link seems to do its work by creating a trigger > which fires > once for every record you modify in its "shadow table". I'm not > sure what > you're doing with those records but if your sending them on via an SPI > connection to another table you'll get a subtransaction every time > the trigger > fires. > > It would be interesting to know which of these it is because in the > former > case it may be something that could be fixed. We only really need > to remember > subtransactions that have hit disk. But I rather suspect it's the > latter case > since it's easy to see you firing a trigger 4.3M times. My remote_select() in DBI does a RETURN NEXT $row; You think that might be the problem? If that's the case -- that needs to be fixed. The metalevel of the remote_select is: remote_select(query) { handle = remote.prepare(query) handle.execute; while(row = handle.fetchrow_hashref) { return_next$row; } handle.close; return; } If that return_next is causing an subtransaction that would explain my world of pain well. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes: > In production today (8.1.4), I ran into a backend process that > wouldn't cancel right away -- minutes went by. > It was in > [0] TransactionIdIsCurrentTransactionId > [1] HeapTupleSatisfiesSnapshot > ... > But the interesting thing is that there were 4.6 million elements in > the s->childXids list. Which is why it took so damn long. Well, I don't think TransactionIdIsCurrentTransactionId() itself is to blame --- the loop over childXids is about as tight as such a loop could possibly be (two indirect fetches and a comparison per iteration). Even with 4.6M child Xids it could hardly take a second on any modern machine. I'm not inclined to add a CHECK_FOR_INTERRUPTS there. The problem should instead be blamed on something further down the call stack ... did you save the full stack by any chance? regards, tom lane