Thread: Idle In Transaction
Hi all, I'm bordering on insanity, trying to track down an IDLE in transaction problem. This started a few weeks ago, and we are using a Java application, running Spring 2.0, Hibernate 3.2 (with L2 cache), Postgres JDBC 8.3-604. We're also using pgBouncer (though, I've tried pgPool II and gotten the same). Every so often (usually in the early morning), we are seeing an "<IDLE> in transaction" show up. This appears to lock / block other statements from going through, though I'm not sure why. If left unchecked, we end up with all of our connections being overrun. For the last several days, I've woken up rather early and run: SELECT datname, usename, procpid, client_addr, waiting, query_start, current_query FROM pg_stat_activity order by query_start asc; This shows me something like: datname | usename | procpid | client_addr | waiting | query_start | current_query -----------+------------+---------+--------------+---------+-------------------------------+---------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------------------------------------------------------------------------------------- scheduler | ressystems | 6632 | 192.168.1.10 | t | 2010-07-09 05:56:51.309911-05 | update employee set min_hours_per_day=$1, max_hours_per_day=$2, min_hours=$ 3, max_hours=$4, min_pay=$5, max_pay=$6, max_shifts=$7, min_shifts=$8, dateCreated=$9, user_id=$10, overtime=$11, notice_timeframe=$12, hire_date=$13, termination_date= $14, active_attendance_block=$15, internal_number=$16, internal_password=$17, max_shifts_per_week=$18, min_shifts_per_week=$19, birth_date=$20, ssn=$21, ssn_suffix=$22, payroll_id=$23, home_location_id=$24, home_department_id=$25, parent_id=$26, deleted=$27, startTimestamp=$28, endTimestamp=$29 where id=$30 scheduler | ressystems | 11076 | 192.168.1.10 | f | 2010-07-09 06:43:52.852328-05 | <IDLE> in transaction In short, there are two queries. The "update" has been running since 5:56am, and is waiting on the other backend to finish. When looking at the locks table, I run: select bl.pid as blocked_pid, a.usename as blocked_user, kl.pid as blocking_pid, ka.usename as blocking_user, a.current_query as blocked_statement from pg_catalog.pg_locks bl join pg_catalog.pg_stat_activity a on bl.pid = a.procpid join pg_catalog.pg_locks kl join pg_catalog.pg_stat_activity ka on kl.pid = ka.procpid on bl.transactionid = kl.transactionid and bl.pid != kl.pid where not bl.granted; blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_ statement -------------+--------------+--------------+---------------+------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------- 6632 | ressystems | 11076 | ressystems | update employee set min_hours_per_day=$1, max_hours_per_day=$2, min_hours=$3, max_hours=$4, min_pay=$5, max _pay=$6, max_shifts=$7, min_shifts=$8, dateCreated=$9, user_id=$10, overtime=$11, notice_timeframe=$12, hire_date=$13, termination_date=$14, active_attendance_block=$15 , internal_number=$16, internal_password=$17, max_shifts_per_week=$18, min_shifts_per_week=$19, birth_date=$20, ssn=$21, ssn_suffix=$22, payroll_id=$23, home_location_i d=$24, home_department_id=$25, parent_id=$26, deleted=$27, startTimestamp=$28, endTimestamp=$29 where id=$30 (1 row) Which tells me that 6632 (an update command) is blocked because of 11076, an "IDLE in transaction" command, which I gather has been sitting there, IDLE, for almost an hour? Would a connection pooler cause (or amplify) any issues relating to this? How can I track down the issue here? I've been looking through web app and database logs without much luck. Any way to have PostgreSQL log when a transaction takes to long in IDLE, and perhaps what the last few transactions were? (I've manually done this, though we are only logging statements taking longer than 5 ms, and there doesn't appear to be any consistency.) Thanks! -- Anthony
Anthony Presley <anthony@resolution.com> writes: > Every so often (usually in the early morning), we are seeing an "<IDLE> > in transaction" show up. This appears to lock / block other statements > from going through, though I'm not sure why. If left unchecked, we end > up with all of our connections being overrun. Well, the idle transaction is evidently sitting on some lock that the UPDATE needs. You didn't show the pg_locks columns that would tell exactly what lock it is though ... > Would a connection pooler cause (or amplify) any issues relating to > this? It shouldn't. Any decent pooler will ensure that no transaction remains open when it transfers the connection to another client. > How can I track down the issue here? I've been looking through > web app and database logs without much luck. Can you track the session connection (the port number) back to a client process? If there's a pooler in the way you'll probably need to crank up its logging level to be able to make that association. Once you've got that, you could attach to the client with a debugger and see what it thinks it's doing. The other line of attack I can think of is to turn on log_connections and log_statements and make sure log_line_prefix includes the PID. Then you can find the series of statements that were issued before the idle transaction went to sleep, and that hopefully is enough information to track down the client code. regards, tom lane
On Tue, Jul 13, 2010 at 02:53:25PM -0500, Anthony Presley wrote: > I'm bordering on insanity, trying to track down an IDLE in transaction > problem. you might find this helpful: http://www.depesz.com/index.php/2008/08/28/hunting-idle-in-transactions/ Pozdrawiam, Hubert Lubaczewski -- Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007
Hubert, :-) Your script was one of the first that I found, thanks to the power of Google. My issue with your script is that, for one reason or another, when piping the logs through rsyslog, we end up with log lines that your perl code won't help, like: Jul 12 04:02:10 artemis postgres[27803]: [53-1] 2010-07-12 04:02:10.980 CDT food@scheduler 27803 192.168.1.10(50304)LOG: statement: BEGIN;SET statement_timeout=30000;COMMIT;SELECT last_value, slots, used, ROUND(used/slots*100) AS percent, CASE WHEN slots < used THEN 0 ELSE slots - used END AS numleft FROM (SELECT last_value, CEIL((LEAST(max_value, 2147483647)-min_value::numeric +1)/increment_by::NUMERIC) AS slots, CEIL((last_value-min_value::numeric +1)/increment_by::NUMERIC) AS used FROM public.minor_preference_id_seq) foo Jul 12 04:02:11 artemis postgres[28022]: [24-1] 2010-07-12 04:02:11.167 CDT food@scheduler 28022 192.168.1.10(47322)LOG: duration: 273.333 ms Jul 12 04:02:11 artemis postgres[27487]: [13097-1] 2010-07-12 04:02:11.337 CDT food@scheduler 27487 192.168.1.10(39055)LOG: duration: 376.298 ms Jul 12 04:02:11 artemis postgres[27803]: [54-1] 2010-07-12 04:02:11.456 CDT food@scheduler 27803 192.168.1.10(50304)LOG: duration: 476.522 ms IE, the duration ends up on a different line, and basically none of the statements ever match in your perl script. Any guess here? -- Anthony On Wed, 2010-07-14 at 14:28 +0200, hubert depesz lubaczewski wrote: > On Tue, Jul 13, 2010 at 02:53:25PM -0500, Anthony Presley wrote: > > I'm bordering on insanity, trying to track down an IDLE in transaction > > problem. > > you might find this helpful: > http://www.depesz.com/index.php/2008/08/28/hunting-idle-in-transactions/ > > Pozdrawiam, > > Hubert Lubaczewski >
On Wed, Jul 14, 2010 at 08:48:20AM -0500, Anthony Presley wrote: > IE, the duration ends up on a different line, and basically none of the > statements ever match in your perl script. > Any guess here? You can modify the script to match format, but I have simpler solution - don't use syslog - at least for the time of hunting. Best regards, depesz -- Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007
On Tue, 2010-07-13 at 17:38 -0400, Tom Lane wrote: > Anthony Presley <anthony@resolution.com> writes: > > Every so often (usually in the early morning), we are seeing an "<IDLE> > > in transaction" show up. This appears to lock / block other statements > > from going through, though I'm not sure why. If left unchecked, we end > > up with all of our connections being overrun. > > Well, the idle transaction is evidently sitting on some lock that the > UPDATE needs. You didn't show the pg_locks columns that would tell > exactly what lock it is though ... > > > Would a connection pooler cause (or amplify) any issues relating to > > this? > > It shouldn't. Any decent pooler will ensure that no transaction remains > open when it transfers the connection to another client. > > > How can I track down the issue here? I've been looking through > > web app and database logs without much luck. > > Can you track the session connection (the port number) back to a client > process? If there's a pooler in the way you'll probably need to crank > up its logging level to be able to make that association. Once you've > got that, you could attach to the client with a debugger and see what it > thinks it's doing. > > The other line of attack I can think of is to turn on log_connections > and log_statements and make sure log_line_prefix includes the PID. > Then you can find the series of statements that were issued before > the idle transaction went to sleep, and that hopefully is enough > information to track down the client code. > > regards, tom lane Ok, I've written a script to find some of this information when an <IDLE> in transaction has been hanging out too long, and then run some SQL commands. Since there's a lot there, I've added it to PasteBin: http://pastebin.com/TpfKd9Ya In looking at it, it would appear that process 30367 has a RowExclusiveLock on the forecast_timeblock table ... though the query that is hung and waiting is on the employee table. However, 8982 is blocked, waiting on 30637. 8982 is an insert into the employee table. forecast_timeblock has a foreign key to the employee table, ie: Foreign-key constraints: "fk80bcf09c965efae7" FOREIGN KEY (employee_id) REFERENCES employee(id) I manually canceled the backend (8982), and tried running it again a few minutes later. And this was in the query log: 06:41:21 artemis postgres[8982]: [35-1] LOG: process 8982 still waiting for ShareLock on transaction 5153723 after 1000.797 ms Any idea what gives? I don't understand locks well enough to see what's going on, but why would inserting into forecast_timeblock cause it to block an insert into the employee table (which has no relation back to forecast_timeblock). Thanks! By the way, we're running: > select version(); version ------------------------------------------------------------------------------------------------------------------ PostgreSQL 8.4.2 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42), 64-bit -- Anthony
Anthony Presley <anthony@resolution.com> writes: > Ok, I've written a script to find some of this information when an > <IDLE> in transaction has been hanging out too long, and then run some > SQL commands. Since there's a lot there, I've added it to PasteBin: > http://pastebin.com/TpfKd9Ya It would help if you'd shown the actual queries you're using, because these outputs seem to be just a subset of what's going on --- in particular, your pg_locks output doesn't include *any* ungranted locks, so it's clearly not telling us what we need to know. The only thing I can see here that looks suspicious is that process 30637 (the idle one) has a RowShareLock on the employee table, which implies that it has done a SELECT FOR UPDATE or SELECT FOR SHARE on that table. What is most likely happening is that it has got a row-level lock as a result of that on some row that the other process's UPDATE is trying to change. Row-level blocking shows up only rather indirectly in the pg_locks table --- typically as a block on a transaction ID --- so I think the reason we can't see anything there is that you're omitting the entries that would tell us about it. > forecast_timeblock has a foreign key to the employee table, ie: > Foreign-key constraints: > "fk80bcf09c965efae7" FOREIGN KEY (employee_id) REFERENCES employee(id) > Any idea what gives? I don't understand locks well enough to see what's > going on, but why would inserting into forecast_timeblock cause it to > block an insert into the employee table (which has no relation back to > forecast_timeblock). Well, an insert into forecast_timeblock would result in taking a SELECT FOR SHARE lock on the referenced employee row. (This is needed to make sure nobody else deletes the referenced row before the insert commits. Without it, there'd be a race condition that would allow the FK constraint integrity to be violated.) Is it likely that 8982 is trying to update the same employee row that 30637 previously inserted a reference to? If so, that's the cause of the blockage. The real bottom line here, of course, is that sitting around with an uncommitted transaction is bad news for concurrency. You need to fix the application-side logic to not do that. regards, tom lane