Re: strange buildfarm failures - Mailing list pgsql-hackers

From Stefan Kaltenbrunner
Subject Re: strange buildfarm failures
Date
Msg-id 46302E43.4020509@kaltenbrunner.cc
Whole thread Raw
In response to Re: strange buildfarm failures  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-hackers
Alvaro Herrera wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> Stefan Kaltenbrunner wrote:
>>>> two of my buildfarm members had different but pretty weird looking
>>>> failures lately:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
>>>> and
>>>>
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
>>>>
>>>> any ideas on what might causing those ?
> 
> Just for the record, quagga and emu failures don't seem related to the
> report below.  They don't crash; the regression.diffs contains data that
> suggests that there may be data corruption of some sort.
> 
> INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
> ERROR:  invalid cidr value: "%{"
> 
> This doesn't seem to make much sense.

yeah on further reflection it looks like the failures from emu and
quagga seem unrelated to the issue lionfish is experiencing

> 
> 
>>> lionfish just failed too:
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09
>> And had a similar failure a few days ago.  The curious thing is that
>> what we get in the postmaster log is
>>
>> LOG:  server process (PID 23405) was terminated by signal 6: Aborted
>> LOG:  terminating any other active server processes
>>
>> You would think SIGABRT would come from an assertion failure, but
>> there's no preceding assertion message in the log.  The other
>> characteristic of these crashes is that *all* of the failing regression
>> instances report "terminating connection because of crash of another
>> server process", which suggests strongly that the crash was in an
>> autovacuum process (if it were bgwriter or stats collector the
>> postmaster would've said so).  So I think the recent autovac patches
>> are at fault.  I spent a bit of time trolling for a spot where the code
>> might abort() without having printed anything, but didn't find one.
> 
> Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
> that could be connected to autovacuum so I neglected it.
> 
> This is the other failure:
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-20%2005:30:14
> 
> It shows the same pattern.  I am baffled -- I don't understand how it
> can die without reporting the error.

I should have mentioned that initially - but I think the failure from
2007-04-20 is not related at all.
The failure from 2007-04-20 was very likely caused due to the kernel
running totally out of memory (lionfish is a very resource starved box
at only 48MB of RAM and 128MB of swap at that time - do we have a recent
patch that is increasing memory usage quite a lot?).
I immediatly added another 128MB of swap after that and I don't think
the failure from yesterday is the same (at least there are no kernel
logs that indicate a similiar issue)
> 
> Apparently it crashes rather frequently, so it shouldn't be too
> difficult to reproduce on manual runs.  If we could get it to run with a
> higher debug level, it might prove helpful to further pinpoint the
> problem.

a manual run of the buildfarm script takes ~4,5 hours on lionfish ;-)

> 
> The core file would be much better obviously (first and foremost to
> confirm that it's autovacuum that's crashing ... )

I will see what I can come up with ...


Stefan


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: strange buildfarm failures
Next
From: "Simon Riggs"
Date:
Subject: Re: [GENERAL] Vacuum-full very slow