Re: SIGUSR1 pingpong between master na autovacum launcher causes crash - Mailing list pgsql-hackers

From Zdenek Kotala
Subject Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date
Msg-id 1251114447.3252.16.camel@localhost
Whole thread Raw
In response to Re: SIGUSR1 pingpong between master na autovacum launcher causes crash  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: SIGUSR1 pingpong between master na autovacum launcher causes crash  (Zdenek Kotala <Zdenek.Kotala@Sun.COM>)
Re: SIGUSR1 pingpong between master na autovacum launcher causes crash  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane píše v so 22. 08. 2009 v 09:56 -0400:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> > There are most important records from yesterdays issues. 
> > Messages:
> > ---------
> > Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 507
(postgres)
> 
> Hmm, that seems to confirm the idea that something had run the machine
> out of memory/swap space, which would explain the repeated ENOMEM fork
> failures.  But we're still no closer to understanding how come the
> delay in the avlauncher didn't do what it was supposed to.

I found hungry process which eats up all memory and fortunately it is
not postgres :-).

I run also following dtrace script:

dtrace  -n 'syscall::kill:entry / execname=="postgres"/ { printf("%i  %
s, %i->%i : %i", timestamp, execname, pid, arg0, arg1); }'

and it show following (little bit modified) output:

<snip>
CPU      Timestamp[ns]     diff[ms]    caller          callee  sig
0    2750745000052090    899,96        28604    ->    28608    16
3    2750745100280460    100,23        28608    ->    28604    16
1    2750746000144690    899,86        28604    ->    28608    16
3    2750746100380940    100,24        28608    ->    28604    16
2    2750747000135380    899,75        28604    ->    28608    16
3    2750747100171650    100,04        28608    ->    28604    16
0    2750748000101050    899,93        28604    ->    28608    16
3    2750748100331900    100,23        28608    ->    28604    16
1    2750749000148550    899,82        28604    ->    28608    16
3    2750749100386640    100,24        28608    ->    28604    16
2    2750750000095040    899,71        28604    ->    28608    16
3    2750750100127780    100,03        28608    ->    28604    16

We can see there that AVlauncher really wait 100ms, but it is not enough
when system is under stress.

I tested Alvaro's patch and it works, because it does not lead to stack
consumption, but it shows another bug in StartAutovacuumWorker() code.
When fork fails bn structure is freed but 
ReleasePostmasterChildSlot() should be called as well. See error:

2009-08-24 11:50:20.360 CEST 3468 FATAL:  no free slots in PMChildFlags array

and comment in source code:

/* Out of slots ... should never happen, else postmaster.c messed up */

I think that Alvaro's patch is good and it fix a crash problem. I also
think that AVlauncher could wait little bit more then 100ms. When system
cannot fork, I don't see any reason why hurry to repeat a fork
operation. Maybe 1s is good compromise. 
Zdenek







pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: 8.5 release timetable, again
Next
From: Zdenek Kotala
Date:
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash