Re: failures on machines using jfs - Mailing list pgsql-performance

From Christopher Browne
Subject Re: failures on machines using jfs
Date
Msg-id 60k73zck3x.fsf@dev6.int.libertyrms.info
Whole thread Raw
In response to failures on machines using jfs  (Andrew Sullivan <andrew@libertyrms.info>)
List pgsql-performance
Robert_Creager@LogicalChaos.org (Robert Creager) writes:
> When grilled further on (Wed, 7 Jan 2004 18:06:08 -0500),
> Andrew Sullivan <andrew@libertyrms.info> confessed:
>
>> We have lately had a couple of cases where machines either locked
>> up, slowed down to the point of complete unusability, or died
>> completely while using jfs.  We are _not_ sure that jfs is in fact
>> the culprit.  In one case, a kernel panic appeared to be referring
>> to the jfs kernel module, but I can't be sure as I lost the output
>> immediately thereafter.  Yesterday, we had a problem of data
>> corruption on a failed jfs volume.
>>
>> None of this is to say that jfs is in fact to blame, nor even that,
>> if it is, it does not have something to do with the age of our
>> installations, &c. (these are all RH 8).  In fact, I suspect
>> hardware in both cases.  But I thought I'd mention it just in case
>> other people are seeing strange behaviour, on the principle of
>> "better safe than sorry."
>
> Interestingly enough, I'm using JFS on a new scsi disk with Mandrake
> 9.1 and was having similar problems.  I was generating heavy disk
> usage through database and astronomical data reductions.  My machine
> (dual AMD) would suddenly hang.  No new jobs would run, just
> increase the load, until I reboot the machine.
>
> I solved my problems by creating a 128Mb ram disk (using EXT2) for
> the temp data produced my reduction runs.
>
> I believe JFS was to blame, not hardware, but you never know...

Interesting.

The set of concurrent factors that came together to appear when this
happened "consistently" were thus:

 1.  Heavy DB updates taking place on JFS filesystems;

 2.  SMP (we suspected Xeon hyperthreading as a possible factor, but
     shut it off and still saw the same problem...)

 3.  The third factor that appeared a catalyst was copying, via scp, a
     file > 2GB in size onto the system.

The third piece was a particularly interesting aspect; the file would
get copied over successfully, and the scp process would hang (to the
point of "kill -9" being unable to touch it) immediately thereafter.

At that point, processes on the system that were accessing files on
the hung-up filesystem were locked, also unkillable by "kill 9."
That's certainly consistent with JFS being at the root of the problem,
whether it was the cause or not...
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

pgsql-performance by date:

Previous
From: Robert Creager
Date:
Subject: Re: failures on machines using jfs
Next
From: "D. Dante Lorenso"
Date:
Subject: Postgresql on Quad CPU machine