Thread: Idle In Transaction

Idle In Transaction

From

Anthony Presley

Date:

13 July 2010, 20:02:13

Hi all,

I'm bordering on insanity, trying to track down an IDLE in transaction
problem.

This started a few weeks ago, and we are using a Java application,
running Spring 2.0, Hibernate 3.2 (with L2 cache), Postgres JDBC
8.3-604.  We're also using pgBouncer (though, I've tried pgPool II and
gotten the same).

Every so often (usually in the early morning), we are seeing an "<IDLE>
in transaction" show up.  This appears to lock / block other statements
from going through, though I'm not sure why.  If left unchecked, we end
up with all of our connections being overrun.

For the last several days, I've woken up rather early and run:

SELECT datname, usename, procpid, client_addr, waiting, query_start,
current_query FROM pg_stat_activity order by query_start asc;

This shows me something like:

  datname  |  usename   | procpid | client_addr  | waiting |
query_start
|

                                                

current_query
                      

                     

-----------+------------+---------+--------------+---------+-------------------------------+----------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------------------
 scheduler | ressystems |    6632 | 192.168.1.10 | t       | 2010-07-09
05:56:51.309911-05 | update employee set min_hours_per_day=$1,
max_hours_per_day=$2, min_hours=$
3, max_hours=$4, min_pay=$5, max_pay=$6, max_shifts=$7, min_shifts=$8,
dateCreated=$9, user_id=$10, overtime=$11, notice_timeframe=$12,
hire_date=$13, termination_date=
$14, active_attendance_block=$15, internal_number=$16,
internal_password=$17, max_shifts_per_week=$18, min_shifts_per_week=$19,
birth_date=$20, ssn=$21, ssn_suffix=$22,
 payroll_id=$23, home_location_id=$24, home_department_id=$25,
parent_id=$26, deleted=$27, startTimestamp=$28, endTimestamp=$29 where
id=$30

 scheduler | ressystems |   11076 | 192.168.1.10 | f       | 2010-07-09
06:43:52.852328-05 | <IDLE> in transaction

In short, there are two queries.  The "update" has been running since
5:56am, and is waiting on the other backend to finish.

When looking at the locks table, I run:

select bl.pid as blocked_pid, a.usename as blocked_user,
    kl.pid as blocking_pid, ka.usename as blocking_user, a.current_query as
blocked_statement
 from pg_catalog.pg_locks bl
      join pg_catalog.pg_stat_activity a
      on bl.pid = a.procpid
      join pg_catalog.pg_locks kl
           join pg_catalog.pg_stat_activity ka
           on kl.pid = ka.procpid
      on bl.transactionid = kl.transactionid and bl.pid != kl.pid
 where not bl.granted;

blocked_pid | blocked_user | blocking_pid | blocking_user
|

blocked_
statement
                                                


-------------+--------------+--------------+---------------+------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------
        6632 | ressystems   |        11076 | ressystems    | update
employee set min_hours_per_day=$1, max_hours_per_day=$2, min_hours=$3,
max_hours=$4, min_pay=$5, max
_pay=$6, max_shifts=$7, min_shifts=$8, dateCreated=$9, user_id=$10,
overtime=$11, notice_timeframe=$12, hire_date=$13, termination_date=$14,
active_attendance_block=$15
, internal_number=$16, internal_password=$17, max_shifts_per_week=$18,
min_shifts_per_week=$19, birth_date=$20, ssn=$21, ssn_suffix=$22,
payroll_id=$23, home_location_i
d=$24, home_department_id=$25, parent_id=$26, deleted=$27,
startTimestamp=$28, endTimestamp=$29 where id=$30
(1 row)

Which tells me that 6632 (an update command) is blocked because of
11076, an "IDLE in transaction" command, which I gather has been sitting
there, IDLE, for almost an hour?

Would a connection pooler cause (or amplify) any issues relating to
this?  How can I track down the issue here?  I've been looking through
web app and database logs without much luck.

Any way to have PostgreSQL log when a transaction takes to long in IDLE,
and perhaps what the last few transactions were?  (I've manually done
this, though we are only logging statements taking longer than 5 ms, and
there doesn't appear to be any consistency.)

Thanks!


--
Anthony

Re: Idle In Transaction

From

Tom Lane

Date:

13 July 2010, 21:38:52

Anthony Presley <anthony@resolution.com> writes:
> Every so often (usually in the early morning), we are seeing an "<IDLE>
> in transaction" show up.  This appears to lock / block other statements
> from going through, though I'm not sure why.  If left unchecked, we end
> up with all of our connections being overrun.

Well, the idle transaction is evidently sitting on some lock that the
UPDATE needs.  You didn't show the pg_locks columns that would tell
exactly what lock it is though ...

> Would a connection pooler cause (or amplify) any issues relating to
> this?

It shouldn't.  Any decent pooler will ensure that no transaction remains
open when it transfers the connection to another client.

> How can I track down the issue here?  I've been looking through
> web app and database logs without much luck.

Can you track the session connection (the port number) back to a client
process?  If there's a pooler in the way you'll probably need to crank
up its logging level to be able to make that association.  Once you've
got that, you could attach to the client with a debugger and see what it
thinks it's doing.

The other line of attack I can think of is to turn on log_connections
and log_statements and make sure log_line_prefix includes the PID.
Then you can find the series of statements that were issued before
the idle transaction went to sleep, and that hopefully is enough
information to track down the client code.

            regards, tom lane

Re: Idle In Transaction

From

hubert depesz lubaczewski

Date:

14 July 2010, 12:29:01

On Tue, Jul 13, 2010 at 02:53:25PM -0500, Anthony Presley wrote:
> I'm bordering on insanity, trying to track down an IDLE in transaction
> problem.

you might find this helpful:
http://www.depesz.com/index.php/2008/08/28/hunting-idle-in-transactions/

Pozdrawiam,

Hubert Lubaczewski

--
Linkedin: http://www.linkedin.com/in/depesz  /  blog: http://www.depesz.com/
jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Re: Idle In Transaction

From

Anthony Presley

Date:

14 July 2010, 13:49:24

Hubert,

:-)  Your script was one of the first that I found, thanks to the power
of Google.

My issue with your script is that, for one reason or another, when
piping the logs through rsyslog, we end up with log lines that your perl
code won't help, like:

Jul 12 04:02:10 artemis postgres[27803]: [53-1] 2010-07-12 04:02:10.980
CDT food@scheduler 27803 192.168.1.10(50304)LOG:  statement: BEGIN;SET
statement_timeout=30000;COMMIT;SELECT last_value, slots, used,
ROUND(used/slots*100) AS percent, CASE WHEN slots < used THEN 0 ELSE
slots - used END AS numleft FROM  (SELECT last_value,
CEIL((LEAST(max_value, 2147483647)-min_value::numeric
+1)/increment_by::NUMERIC) AS slots, CEIL((last_value-min_value::numeric
+1)/increment_by::NUMERIC) AS used FROM public.minor_preference_id_seq)
foo

Jul 12 04:02:11 artemis postgres[28022]: [24-1] 2010-07-12 04:02:11.167
CDT food@scheduler 28022 192.168.1.10(47322)LOG:  duration: 273.333 ms

Jul 12 04:02:11 artemis postgres[27487]: [13097-1] 2010-07-12
04:02:11.337 CDT food@scheduler 27487 192.168.1.10(39055)LOG:  duration:
376.298 ms

Jul 12 04:02:11 artemis postgres[27803]: [54-1] 2010-07-12 04:02:11.456
CDT food@scheduler 27803 192.168.1.10(50304)LOG:  duration: 476.522 ms

IE, the duration ends up on a different line, and basically none of the
statements ever match in your perl script.

Any guess here?


--
Anthony

On Wed, 2010-07-14 at 14:28 +0200, hubert depesz lubaczewski wrote:
> On Tue, Jul 13, 2010 at 02:53:25PM -0500, Anthony Presley wrote:
> > I'm bordering on insanity, trying to track down an IDLE in transaction
> > problem.
>
> you might find this helpful:
> http://www.depesz.com/index.php/2008/08/28/hunting-idle-in-transactions/
>
> Pozdrawiam,
>
> Hubert Lubaczewski
>

Re: Idle In Transaction

From

hubert depesz lubaczewski

Date:

14 July 2010, 14:18:50

On Wed, Jul 14, 2010 at 08:48:20AM -0500, Anthony Presley wrote:
> IE, the duration ends up on a different line, and basically none of the
> statements ever match in your perl script.
> Any guess here?

You can modify the script to match format, but I have simpler solution -
don't use syslog - at least for the time of hunting.

Best regards,

depesz

--
Linkedin: http://www.linkedin.com/in/depesz  /  blog: http://www.depesz.com/
jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Re: Idle In Transaction

From

Anthony Presley

Date:

15 July 2010, 16:17:37

On Tue, 2010-07-13 at 17:38 -0400, Tom Lane wrote:
> Anthony Presley <anthony@resolution.com> writes:
> > Every so often (usually in the early morning), we are seeing an "<IDLE>
> > in transaction" show up.  This appears to lock / block other statements
> > from going through, though I'm not sure why.  If left unchecked, we end
> > up with all of our connections being overrun.
>
> Well, the idle transaction is evidently sitting on some lock that the
> UPDATE needs.  You didn't show the pg_locks columns that would tell
> exactly what lock it is though ...
>
> > Would a connection pooler cause (or amplify) any issues relating to
> > this?
>
> It shouldn't.  Any decent pooler will ensure that no transaction remains
> open when it transfers the connection to another client.
>
> > How can I track down the issue here?  I've been looking through
> > web app and database logs without much luck.
>
> Can you track the session connection (the port number) back to a client
> process?  If there's a pooler in the way you'll probably need to crank
> up its logging level to be able to make that association.  Once you've
> got that, you could attach to the client with a debugger and see what it
> thinks it's doing.
>
> The other line of attack I can think of is to turn on log_connections
> and log_statements and make sure log_line_prefix includes the PID.
> Then you can find the series of statements that were issued before
> the idle transaction went to sleep, and that hopefully is enough
> information to track down the client code.
>
>             regards, tom lane

Ok, I've written a script to find some of this information when an
<IDLE> in transaction has been hanging out too long, and then run some
SQL commands.  Since there's a lot there, I've added it to PasteBin:

  http://pastebin.com/TpfKd9Ya

In looking at it, it would appear that process 30367 has a
RowExclusiveLock on the forecast_timeblock table ... though the query
that is hung and waiting is on the employee table.

However, 8982 is blocked, waiting on 30637.  8982 is an insert into the
employee table.

forecast_timeblock has a foreign key to the employee table, ie:

Foreign-key constraints:
    "fk80bcf09c965efae7" FOREIGN KEY (employee_id) REFERENCES
employee(id)


I manually canceled the backend (8982), and tried running it again a few
minutes later.

And this was in the query log:

06:41:21 artemis postgres[8982]: [35-1] LOG:  process 8982 still waiting
for ShareLock on transaction 5153723 after 1000.797 ms

Any idea what gives?  I don't understand locks well enough to see what's
going on, but why would inserting into forecast_timeblock cause it to
block an insert into the employee table (which has no relation back to
forecast_timeblock).

Thanks!

By the way, we're running:

> select version();

version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.2 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20071124 (Red Hat 4.1.2-42), 64-bit


--
Anthony

Re: Idle In Transaction

From

Tom Lane

Date:

15 July 2010, 16:58:43

Anthony Presley <anthony@resolution.com> writes:
> Ok, I've written a script to find some of this information when an
> <IDLE> in transaction has been hanging out too long, and then run some
> SQL commands.  Since there's a lot there, I've added it to PasteBin:
>   http://pastebin.com/TpfKd9Ya

It would help if you'd shown the actual queries you're using, because
these outputs seem to be just a subset of what's going on --- in
particular, your pg_locks output doesn't include *any* ungranted locks,
so it's clearly not telling us what we need to know.

The only thing I can see here that looks suspicious is that process
30637 (the idle one) has a RowShareLock on the employee table, which
implies that it has done a SELECT FOR UPDATE or SELECT FOR SHARE on that
table.  What is most likely happening is that it has got a row-level
lock as a result of that on some row that the other process's UPDATE is
trying to change.  Row-level blocking shows up only rather indirectly in
the pg_locks table --- typically as a block on a transaction ID --- so
I think the reason we can't see anything there is that you're omitting the
entries that would tell us about it.

> forecast_timeblock has a foreign key to the employee table, ie:
> Foreign-key constraints:
>     "fk80bcf09c965efae7" FOREIGN KEY (employee_id) REFERENCES employee(id)

> Any idea what gives?  I don't understand locks well enough to see what's
> going on, but why would inserting into forecast_timeblock cause it to
> block an insert into the employee table (which has no relation back to
> forecast_timeblock).

Well, an insert into forecast_timeblock would result in taking a SELECT
FOR SHARE lock on the referenced employee row.  (This is needed to make
sure nobody else deletes the referenced row before the insert commits.
Without it, there'd be a race condition that would allow the FK
constraint integrity to be violated.)

Is it likely that 8982 is trying to update the same employee row that
30637 previously inserted a reference to?  If so, that's the cause of
the blockage.

The real bottom line here, of course, is that sitting around with an
uncommitted transaction is bad news for concurrency.  You need to fix
the application-side logic to not do that.

            regards, tom lane