Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows
Date
Msg-id 7087.1394738351@sss.pgh.pa.us
Whole thread Raw
In response to BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows  (jmorton@gmail.com)
Responses Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows  ("James L. Morton" <jmorton@gmail.com>)
List pgsql-bugs
jmorton@gmail.com writes:
> (gdb) bt
> #0  0x00000030820e104e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
> #2  0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
> #3  0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> #4  0x00000000006dd6f1 in EmitErrorReport ()
> #5  0x00000000006ddaf5 in errfinish ()
> #6  0x0000000000623f6a in ProcessInterrupts ()
> #7  0x000000000062439c in StatementCancelHandler ()
> #8  <signal handler called>
> #9  0x00000030820306f7 in kill () from /lib64/libc.so.6
> #10 0x0000000000616d01 in ?? ()
> #11 0x0000000000616e3f in handle_sig_alarm ()
> #12 <signal handler called>
> #13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
> #14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
> #15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> #16 0x00000000006dd6f1 in EmitErrorReport ()
> #17 0x00000000006ddaf5 in errfinish ()
> #18 0x0000000000583c57 in ?? ()
> #19 0x0000000000583c81 in ?? ()
> #20 0x00000000005858af in ClientAuthentication ()
> #21 0x00000000006e9bb6 in InitPostgres ()
> #22 0x000000000062609d in PostgresMain ()
> #23 0x00000000005e9dfc in ?? ()
> #24 0x00000000005eab81 in PostmasterMain ()
> #25 0x000000000058e38e in main ()

Interesting stack trace.  Evidently, syslog() on your platform has some
internal mutex.  What seems to have happened is that SIGALRM interrupted
an attempt to log a client-authentication-related message, and then the
error message that it attempted to print resulted in recursive entry to
syslog(), causing a hang since the outer invocation already had the mutex.

This pathway would be impossible if the error level were ERROR, because
errfinish() clears ImmediateInterruptOK in that case, which would prevent
the SIGALRM interrupt handler from doing anything much.  A likely bet
though is that the error level in the outer call was FATAL (ie,
authentication failure); or it might've been just a LOG message.

I'm thinking the most robust solution to this is to have errfinish()
do this at entry:

   bool save_ImmediateInterruptOK = ImmediateInterruptOK;
   ImmediateInterruptOK = false;

and then, *if* it decides to return to caller, do this just before
returning:

   ImmediateInterruptOK = save_ImmediateInterruptOK;
   if (save_ImmediateInterruptOK)
       CHECK_FOR_INTERRUPTS();

This should make the error reporting logic safe against interrupts even
if invoked while ImmediateInterruptOK is true; which probably shouldn't
happen during normal backend operation but is definitely possible during
client authentication.

BTW, are you using an unusually short value for authentication_timeout?
This case doesn't seem likely to arise with normal values, since the
timeout isn't expected to occur unless something's totally wedged.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Fujii Masao
Date:
Subject: Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown
Next
From:
Date:
Subject: Very slow query in PostgreSQL 9.3.3