backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks) - Mailing list pgsql-hackers

From MauMau
Subject backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Date
Msg-id 20DAEA8949EC4E2289C6E8E58560DEC0@maumau
Whole thread Raw
In response to Back-branch update releases coming in a couple weeks  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
List pgsql-hackers
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Since we've fixed a couple of relatively nasty bugs recently, the core
> committee has determined that it'd be a good idea to push out PG update
> releases soon.  The current plan is to wrap on Monday Feb 4 for public
> announcement Thursday Feb 7.  If you're aware of any bug fixes you think
> ought to get included, now's the time to get them done ...

I've just encountered another serious bug, which I wish to be fixed in the 
upcoming minor release.

I'm using streaming replication with PostgreSQL 9.1.6 on Linux (RHEL6.2, 
kernel 2.6.32).  But this problem should happen regardless of the use of 
streaming replication.

When I ran "pg_ctl stop -mi" against the primary, some applications 
connected to the primary did not stop.  The cause was that the backends was 
deadlocked in quickdie() with some call stack like the following.  I'm sorry 
to have left the stack trace file on the testing machine, so I'll show you 
the precise stack trace tomorrow.

some lock function
malloc()
gettext()
errhint()
quickdie()
<signal handler called because of SIGQUIT>
free()
...
PostgresMain()
...

The root cause is that gettext() is called in the signal handler quickdie() 
via errhint().  As you know, malloc() cannot be called in a signal handler:

http://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy

[Excerpt]
On most systems, malloc and free are not reentrant, because they use a 
static data structure which records what memory blocks are free. As a 
result, no library functions that allocate or free memory are reentrant. 
This includes functions that allocate space to store a result.


And gettext() calls malloc(), as reported below:

http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00056.html

I think the solution is the typical one.  That is, to just remember the 
receipt of SIGQUIT by setting a global variable and call siglongjmp() in 
quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() 
returns in PostgresMain().

What do think about the solution?  Could you include the fix?  If it's okay 
and you want, I'll submit the patch.

Regards
MauMau




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: autovacuum not prioritising for-wraparound tables
Next
From: Zoltán Böszörményi
Date:
Subject: Re: Strange Windows problem, lock_timeout test request