Re: failures on machines using jfs - Mailing list pgsql-performance

From Spiegelberg, Greg
Subject Re: failures on machines using jfs
Date
Msg-id 387C22290D3FD71195D300508BF7DB5238AE92@colmail01.cranel.com
Whole thread Raw
In response to failures on machines using jfs  (Andrew Sullivan <andrew@libertyrms.info>)
Responses Re: failures on machines using jfs
Re: failures on machines using jfs
List pgsql-performance
It would seem we're experiencing somthing similiar with our scratch
volume (JFS mounted with noatime).  It is still much faster than our
experiments with ext2, ext3, and reiserfs but occasionally during
large loads it will hiccup for a couple seconds but no crashes yet.

I'm reluctant to switch back to any other file system because the
data import took a little over 1.5 hours but now takes just under
20 minutes and we haven't crashed yet.

For future reference:

 RedHat 7.3 w/2.4.18-18.7smp
 PostgreSQL 7.3.3 from source
 jfsutils 1.0.17-1
 Dual PIII Intel 1.4GHz & 2GB ECC
 Internal disk: 2xU160 SCSI, mirrored, location of our JFS file system
 External disk  Qlogic 2310 attached to FC-SW @2Gbps with ext3 on those LUNs

Greg


-----Original Message-----
From: Christopher Browne
To: pgsql-performance@postgresql.org
Sent: 1/10/04 9:08 PM
Subject: Re: [PERFORM] failures on machines using jfs

Robert_Creager@LogicalChaos.org (Robert Creager) writes:
> When grilled further on (Wed, 7 Jan 2004 18:06:08 -0500),
> Andrew Sullivan <andrew@libertyrms.info> confessed:
>
>> We have lately had a couple of cases where machines either locked
>> up, slowed down to the point of complete unusability, or died
>> completely while using jfs.  We are _not_ sure that jfs is in fact
>> the culprit.  In one case, a kernel panic appeared to be referring
>> to the jfs kernel module, but I can't be sure as I lost the output
>> immediately thereafter.  Yesterday, we had a problem of data
>> corruption on a failed jfs volume.
>>
>> None of this is to say that jfs is in fact to blame, nor even that,
>> if it is, it does not have something to do with the age of our
>> installations, &c. (these are all RH 8).  In fact, I suspect
>> hardware in both cases.  But I thought I'd mention it just in case
>> other people are seeing strange behaviour, on the principle of
>> "better safe than sorry."
>
> Interestingly enough, I'm using JFS on a new scsi disk with Mandrake
> 9.1 and was having similar problems.  I was generating heavy disk
> usage through database and astronomical data reductions.  My machine
> (dual AMD) would suddenly hang.  No new jobs would run, just
> increase the load, until I reboot the machine.
>
> I solved my problems by creating a 128Mb ram disk (using EXT2) for
> the temp data produced my reduction runs.
>
> I believe JFS was to blame, not hardware, but you never know...

Interesting.

The set of concurrent factors that came together to appear when this
happened "consistently" were thus:

 1.  Heavy DB updates taking place on JFS filesystems;

 2.  SMP (we suspected Xeon hyperthreading as a possible factor, but
     shut it off and still saw the same problem...)

 3.  The third factor that appeared a catalyst was copying, via scp, a
     file > 2GB in size onto the system.

The third piece was a particularly interesting aspect; the file would
get copied over successfully, and the scp process would hang (to the
point of "kill -9" being unable to touch it) immediately thereafter.

At that point, processes on the system that were accessing files on
the hung-up filesystem were locked, also unkillable by "kill 9."
That's certainly consistent with JFS being at the root of the problem,
whether it was the cause or not...
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@"
[name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************


pgsql-performance by date:

Previous
From: Doug McNaught
Date:
Subject: Re: Postgresql on Quad CPU machine
Next
From: Tom Lane
Date:
Subject: Re: failures on machines using jfs