Thread: Mac OS X: system shutdown prevents checkpoint

Mac OS X: system shutdown prevents checkpoint

From
Tom Lane
Date:
I've been looking into Francois Suter's recent reports of Postgres not
shutting down cleanly on Mac OS X 10.1.  I find that it's quite
reproducible.  If you tell the system to shut down in the normal
fashion (eg, pick "Shut Down" from the Apple menu), the postmaster
does not terminate, leading to WAL recovery upon restart --- or
even worse, failure to restart if the postmaster PID recorded in the
lockfile happens to get assigned to some other daemon.

Observe the normal trace of postmaster shutdown (running with -d4,
logging of timestamps and PIDs enabled):

2002-04-30 00:08:30 [315]    DEBUG:  pmdie 15
2002-04-30 00:08:30 [315]    DEBUG:  smart shutdown request
2002-04-30 00:08:30 [331]    DEBUG:  shutting down
2002-04-30 00:08:32 [331]    DEBUG:  database system is shut down
2002-04-30 00:08:32 [331]    DEBUG:  proc_exit(0)
2002-04-30 00:08:32 [331]    DEBUG:  shmem_exit(0)
2002-04-30 00:08:32 [331]    DEBUG:  exit(0)
2002-04-30 00:08:32 [315]    DEBUG:  reaping dead processes
2002-04-30 00:08:32 [315]    DEBUG:  proc_exit(0)
2002-04-30 00:08:32 [315]    DEBUG:  shmem_exit(0)
2002-04-30 00:08:32 [315]    DEBUG:  exit(0)

The postmaster (here PID 315) forks a subprocess to flush shared buffers
and checkpoint the WAL log.  When the subprocess exits, the postmaster
removes its lockfile and shuts down.  The subprocess takes a minimum of
2 seconds because there's a sleep(2) in the checkpoint fsync code.

Now here's what I see in the case of shutting down the OS X system:

2002-04-30 00:25:35 [376]    DEBUG:  pmdie 15
2002-04-30 00:25:35 [376]    DEBUG:  smart shutdown request

... and nothing more.  Actual system shutdown (power down) occurred at
approximately 00:26:06 by my watch, over thirty seconds later than the
postmaster received SIGTERM.  So there was plenty of time to do the
checkpoint subprocess.  (Indeed, I believe that thirty seconds is the
grace period Darwin's init process allows SIGTERM'd processes before
giving up and hard-killing them.  So the system was actually sitting and
waiting for the postmaster.)

What we appear to have here is that the kernel is not allowing the
postmaster to fork a checkpoint subprocess.  But there's no indication
that the postmaster got a fork() error return, either.  Seems like it's
just hung.

Does this ring a bell with anyone?  Is it an OSX bug, or a "feature";
and if the latter, how can we work around it?
        regards, tom lane


Re: Mac OS X: system shutdown prevents checkpoint

From
"Christopher Kings-Lynne"
Date:
I showed this to my friend who's a FreeBSD committer (Adrian Chadd) and he's
actually setting up a MacOS/X box at the moment and will look into it -
assuming you don't discover the problem first...

Chris

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: Tuesday, 30 April 2002 1:26 PM
> To: pgsql-hackers@postgresql.org
> Cc: Francois Suter
> Subject: [HACKERS] Mac OS X: system shutdown prevents checkpoint
>
>
> I've been looking into Francois Suter's recent reports of Postgres not
> shutting down cleanly on Mac OS X 10.1.  I find that it's quite
> reproducible.  If you tell the system to shut down in the normal
> fashion (eg, pick "Shut Down" from the Apple menu), the postmaster
> does not terminate, leading to WAL recovery upon restart --- or
> even worse, failure to restart if the postmaster PID recorded in the
> lockfile happens to get assigned to some other daemon.
>
> Observe the normal trace of postmaster shutdown (running with -d4,
> logging of timestamps and PIDs enabled):
>
> 2002-04-30 00:08:30 [315]    DEBUG:  pmdie 15
> 2002-04-30 00:08:30 [315]    DEBUG:  smart shutdown request
> 2002-04-30 00:08:30 [331]    DEBUG:  shutting down
> 2002-04-30 00:08:32 [331]    DEBUG:  database system is shut down
> 2002-04-30 00:08:32 [331]    DEBUG:  proc_exit(0)
> 2002-04-30 00:08:32 [331]    DEBUG:  shmem_exit(0)
> 2002-04-30 00:08:32 [331]    DEBUG:  exit(0)
> 2002-04-30 00:08:32 [315]    DEBUG:  reaping dead processes
> 2002-04-30 00:08:32 [315]    DEBUG:  proc_exit(0)
> 2002-04-30 00:08:32 [315]    DEBUG:  shmem_exit(0)
> 2002-04-30 00:08:32 [315]    DEBUG:  exit(0)
>
> The postmaster (here PID 315) forks a subprocess to flush shared buffers
> and checkpoint the WAL log.  When the subprocess exits, the postmaster
> removes its lockfile and shuts down.  The subprocess takes a minimum of
> 2 seconds because there's a sleep(2) in the checkpoint fsync code.
>
> Now here's what I see in the case of shutting down the OS X system:
>
> 2002-04-30 00:25:35 [376]    DEBUG:  pmdie 15
> 2002-04-30 00:25:35 [376]    DEBUG:  smart shutdown request
>
> ... and nothing more.  Actual system shutdown (power down) occurred at
> approximately 00:26:06 by my watch, over thirty seconds later than the
> postmaster received SIGTERM.  So there was plenty of time to do the
> checkpoint subprocess.  (Indeed, I believe that thirty seconds is the
> grace period Darwin's init process allows SIGTERM'd processes before
> giving up and hard-killing them.  So the system was actually sitting and
> waiting for the postmaster.)
>
> What we appear to have here is that the kernel is not allowing the
> postmaster to fork a checkpoint subprocess.  But there's no indication
> that the postmaster got a fork() error return, either.  Seems like it's
> just hung.
>
> Does this ring a bell with anyone?  Is it an OSX bug, or a "feature";
> and if the latter, how can we work around it?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>



Re: Mac OS X: system shutdown prevents checkpoint

From
Peter Bierman
Date:
At 1:26 AM -0400 4/30/02, Tom Lane wrote:
>I've been looking into Francois Suter's recent reports of Postgres not
>shutting down cleanly on Mac OS X 10.1.
>
>Now here's what I see in the case of shutting down the OS X system:
>
>2002-04-30 00:25:35 [376]    DEBUG:  pmdie 15
>2002-04-30 00:25:35 [376]    DEBUG:  smart shutdown request
>
>... and nothing more.  Actual system shutdown (power down) occurred at
>approximately 00:26:06 by my watch, over thirty seconds later than the
>postmaster received SIGTERM.  So there was plenty of time to do the
>checkpoint subprocess.  (Indeed, I believe that thirty seconds is the
>grace period Darwin's init process allows SIGTERM'd processes before
>giving up and hard-killing them.  So the system was actually sitting and
>waiting for the postmaster.)
>
>What we appear to have here is that the kernel is not allowing the
>postmaster to fork a checkpoint subprocess.  But there's no indication
>that the postmaster got a fork() error return, either.  Seems like it's
>just hung.


Unfortunately, I don't have time right now to look into this myself, and because I just moved, I don't have a machine I
cangive someone an account on to try it themselves (PacBell says 20 days for DSL xfer). But I asked around, and got a
pairof tips from the Mac OS X Core OS group. If you want to converse with either of the people named below, they're
bothactive on the darwin-development mailing list. (http://lists.apple.com/mailman/listinfo/darwin-development) 

-pmb


At 1:52 PM -0700 5/1/02, Jim Magee wrote:
>On Wednesday, May 1, 2002, at 01:34 PM, Peter Bierman wrote:
>
>> Is fork() disallowed after shutdown starts?
>
>No, it's allowed.  But, depending upon timing, the new process may be
>hammered with a SIGTERM right away (maybe even before main()).  It is
>always very tricky to fork() as the result of a daemon getting a signal.
>They are often process group leader, and so their children may get the
>same signal they just got.
>
>POSIX is very ambiguous on whether a new process in the group should also
>get the signal while we're still delivering them, or whether it shouldn't
>because it wasn't in the group at the time the signal was first
>delivered).  Both choices have their problems, and so developers have to
>deal with either case.  Do you have signals masked off correctly before
>the fork()/exec()?
>
>Is fork really returning a PID in the parent, and it just looks like the
>child didn't make it to returning from its fork() call?  There are some
>preparation things that happen in dyld and libc as part of returning fom
>fork in the child, and these run before we make it look like fork()
>returned in the child.  If they encounter an error (maybe because the
>services they need to talk to are no longer available), they have nothing
>else to do but call _exit() - making it look like the child never returned
>from fork().
>
>But in either the dydl/libc exit case, or the signal case, the parent
>should get a wait result indicating why the child went away so
>prematurely.  If is was an exit(), maybe using vfork() will yield better
>results, as there is no need for child-side setup in the vfork() case.
>
>--Jim


At 2:01 PM -0700 5/1/02, Matt Watson wrote:
>
>It could be that the child has blocked trying to contact a dead lookupd.






Re: Mac OS X: system shutdown prevents checkpoint

From
Tom Lane
Date:
Peter Bierman <bierman@apple.com> writes:
> Is fork() disallowed after shutdown starts?
>> 
>> No, it's allowed.  But, depending upon timing, the new process may be
>> hammered with a SIGTERM right away (maybe even before main()).

Good point.  The fork is executed with SIGTERM blocked --- but the
checkpoint child process currently will enable SIGTERM shortly after
being forked.  On reflection that seems like a bad idea; probably the
checkpoint process should ignore SIGTERM so that it won't get killed
prematurely during system shutdown.

However, that doesn't explain our OS X problem.  I added some debug
printouts, and can now report positively that (a) the fork() call
returns normally in the parent process, providing an apparently-correct
child PID value; but (b) the fork never returns in the child.  It
doesn't ever get as far as trying to enable SIGTERM.

>> Is fork really returning a PID in the parent, and it just looks like the
>> child didn't make it to returning from its fork() call?  There are some
>> preparation things that happen in dyld and libc as part of returning fom
>> fork in the child, and these run before we make it look like fork()
>> returned in the child.  If they encounter an error (maybe because the
>> services they need to talk to are no longer available), they have nothing
>> else to do but call _exit() - making it look like the child never returned
>> from fork().

Hmmm ... that seems very close to what I'm seeing.

>> But in either the dydl/libc exit case, or the signal case, the parent
>> should get a wait result indicating why the child went away so
>> prematurely.

The parent is not getting any wait() result indicating that its child died.
(If it were, we'd not have the problem being complained of.)

Is it possible that something in the child's fork() processing will wait
around for a response from a service that's already died?  Why is fork()
dependent on any outside service whatever --- isn't that a certain
recipe for system failures?
        regards, tom lane


Re: Mac OS X: system shutdown prevents checkpoint

From
sugita@sra.co.jp
Date:
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 02 May 2002 00:45:19 -0400

;;; However, that doesn't explain our OS X problem.  I added some debug
;;; printouts, and can now report positively that (a) the fork() call
;;; returns normally in the parent process, providing an apparently-correct
;;; child PID value; but (b) the fork never returns in the child.  It
;;; doesn't ever get as far as trying to enable SIGTERM.
&
;;; Is it possible that something in the child's fork() processing will wait
;;; around for a response from a service that's already died?  Why is fork()
;;; dependent on any outside service whatever --- isn't that a certain
;;; recipe for system failures?

I asked Apple this issue. This is a bug of Mac OS X. The problem is registered
to their bug database for the appropriate eingineers for investigation.


Kenji Sugita



Re: Mac OS X: system shutdown prevents checkpoint

From
sugita@sra.co.jp
Date:
;;; From: sugita@sra.co.jp                                                        
;;; Date: Tue, 16 Jul 2002 10:22:14 +0900 (JST)                                   
;;;                                                                                   
;;; ;;; However, that doesn't explain our OS X problem.  I added some debug           
;;; ;;; printouts, and can now report positively that (a) the fork() call             
;;; ;;; returns normally in the parent process, providing an apparently-correct       
;;; ;;; child PID value; but (b) the fork never returns in the child.  It             
;;; ;;; doesn't ever get as far as trying to enable SIGTERM.                          
;;; &                                                                                 
;;; ;;; Is it possible that something in the child's fork() processing will wait      
;;; ;;; around for a response from a service that's already died?  Why is fork()      
;;; ;;; dependent on any outside service whatever --- isn't that a certain            
;;; ;;; recipe for system failures?                                                   
;;;                                                                                   
;;; I asked Apple this issue. This is a bug of Mac OS X. The problem is registered    
;;; to their bug database for the appropriate eingineers for investigation.           

This problem is fixed for Mac OS X 10.2.


Kenji Sugita