Thread: [BUGS] BUG #14821: idle_in_transaction_session_timeout sometimes getsignored when statement timeout is pending

The following bug has been logged on the website:

Bug reference:      14821
Logged by:          Lukas Fittl
Email address:      lukas@fittl.com
PostgreSQL version: 9.6.4
Operating system:   Amazon Linux RHEL 6
Description:

Hi all,

I've been observing a problem on 9.6.4, but in my understanding this still
applies on master too.

Relevant config:

idle_in_transaction_session_timeout = 30s
statement_timeout = 30s

Backend thats idle in transaction and should have been cancelled:

=> SELECT pid, state, now() - backend_start AS backend_age, now() -
xact_start AS xact_age, now() - state_change AS state_age, query FROM
pg_stat_activity WHERE state = 'idle in transaction' ORDER BY
backend_start;pid  |        state        |   backend_age   |    xact_age     |
state_age    |      query       
------+---------------------+-----------------+-----------------+-----------------+------------------4005 | idle in
transaction| 04:29:51.844063 | 01:19:10.984972 | 
01:19:10.984947 | BEGIN READ WRITE
(1 row)

Backtrace from that backend:

(gdb) bt
#0  0x00007f9dffb298c3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1  0x00000000006a06d6 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7fff0e0f3cf0, cur_timeout=-1, set=0x29ea1a8) at
latch.c:1053
#2  WaitEventSetWait (set=0x29ea1a8, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7fff0e0f3cf0,
nevents=nevents@entry=1) at latch.c:1007
#3  0x00000000005e4985 in secure_read (port=0x2a42f30, ptr=0xc09260
<PqRecvBuffer>, len=8192) at be-secure.c:149
#4  0x00000000005ed898 in pq_recvbuf () at pqcomm.c:940
#5  0x00000000005ee3a5 in pq_getbyte () at pqcomm.c:983
#6  0x00000000006bf012 in SocketBackend (inBuf=0x7fff0e0f3e80) at
postgres.c:326
#7  ReadCommand (inBuf=0x7fff0e0f3e80) at postgres.c:499
#8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x2a4c4f8,
dbname=0x2a4c468 "mydb", username=<optimized out>) at postgres.c:4019
#9  0x000000000046ed2b in BackendRun (port=0x2a42f30) at postmaster.c:4294
#10 BackendStartup (port=0x2a42f30) at postmaster.c:3968
#11 ServerLoop () at postmaster.c:1719
#12 0x0000000000666946 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x29e86b0) at postmaster.c:1327
#13 0x000000000046f8a2 in main (argc=3, argv=0x29e86b0) at main.c:228

So as you can see its waiting on line 326 of tcop/postgres.c for data from
pq_getbyte(). And the reason that ProcessInterrupts() can't move forward is
due to the HOLD_CANCEL_INTERRUPTS() just before that in combination with a
pending query cancellation.

Relevant variables:

(gdb) p InterruptHoldoffCount
$13 = 0
(gdb) p CritSectionCount
$14 = 0
(gdb) p QueryCancelHoldoffCount
$16 = 1
(gdb) p QueryCancelPending
$17 = 1 '\001'
(gdb) p IdleInTransactionSessionTimeout
$1 = 30000
(gdb) p IdleInTransactionSessionTimeoutPending
$2 = 1 '\001'
(gdb) p all_timeouts[IDLE_IN_TRANSACTION_SESSION_TIMEOUT]
$3 = {index = IDLE_IN_TRANSACTION_SESSION_TIMEOUT, indicator = 1 '\001',
timeout_handler = 0x7baee0 <IdleInTransactionSessionTimeoutHandler>,
start_time = 559266384031137, fin_time = 559266414031137}

It seems to me that ProcessInterrupts() should check for the idle in
transaction timeout first, since it short-circuits and returns early if
QueryCancelPending && QueryCancelHoldoffCount != 0 before even getting
there.

Best,
Lukas


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hi,

On 2017-09-21 01:09:56 +0000, lukas@fittl.com wrote:
> It seems to me that ProcessInterrupts() should check for the idle in
> transaction timeout first, since it short-circuits and returns early if
> QueryCancelPending && QueryCancelHoldoffCount != 0 before even getting
> there.

Indeed - although I wonder if the correct fix isn't to move things
around, but to instead avoid the order dependency changing the short
circuit logic so that there's no early return. Like e.g.
if (QueryCancelPending && QueryCancelHoldoffCount != 0){               /* rearm */       }       else if
(QueryCancelPending)      {               /* handle interrupt */       }
 

there's really no good reason for the return right now, and it's bound
to create more bugs in the future.

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hi,

On Wed, Sep 20, 2017 at 6:29 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-09-21 01:09:56 +0000, lukas@fittl.com wrote:
> It seems to me that ProcessInterrupts() should check for the idle in
> transaction timeout first, since it short-circuits and returns early if
> QueryCancelPending && QueryCancelHoldoffCount != 0 before even getting
> there.

Indeed - although I wonder if the correct fix isn't to move things
around, but to instead avoid the order dependency changing the short
circuit logic so that there's no early return. Like e.g.

        if (QueryCancelPending && QueryCancelHoldoffCount != 0)
        {
                /* rearm */
        }
        else if (QueryCancelPending)
        {
                /* handle interrupt */
        }

there's really no good reason for the return right now, and it's bound
to create more bugs in the future.

That seems like a reasonable approach and a good idea.

I'll take a stab at making a patch for this tonight and send it over to hackers then.

Best,
Lukas

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630