Re: optimization by removing the file system layer? - Mailing list pgsql-general

From Giles Lean
Subject Re: optimization by removing the file system layer?
Date
Msg-id 10210.961149148@nemeton.com.au
Whole thread Raw
In response to Re: optimization by removing the file system layer?  (Jurgen Defurne <defurnj@glo.be>)
List pgsql-general

> I think that the Un*x filesystem is one of the reasons that large
> database vendors rather use raw devices, than filesystem storage
> files.

This used to be the preference, back in the late 80s and possibly
early 90s.  I'm seeing a preference toward using the filesystem now,
possibly with some sort of async I/O and co-operation from the OS
filesystem about interactions with the filesystem cache.

Performance preferences don't stand still.  The hardware changes, the
software changes, the volume of data changes, and different solutions
become preferable.

> Using a raw device on the disk gives them the possibility to have
> complete control over their files, indices and objects without being
> bothered by the operating system.
>
> This speeds up things in several ways :
> - the least possible OS intervention

Not that this is especially useful, necessarily.  If the "raw" device
is in fact managed by a logical volume manager doing mirroring onto
some sort of storage array there is still plenty of OS code involved.

The cost of using a filesystem in addition may not be much if anything
and of course a filesystem is considerably more flexible to
administer (backup, move, change size, check integrity, etc.)

> - choose block sizes according to applications
> - reducing fragmentation
> - packing data in nearby cilinders

... but when this storage area is spread over multiple mechanisms in a
smart storage array with write caching, you've no idea what is where
anyway.  Better to let the hardware or at least the OS manage this;
there are so many levels of caching between a database and the
magnetic media that working hard to influence layout is almost
certainly a waste of time.

Kirk McKusick tells a lovely story that once upon a time it used to be
sensible to check some registers on a particular disk controller to
find out where the heads were when scheduling I/O.  Needless to say,
that is history now!

There's a considerable cost in complexity and code in using "raw"
storage too, and it's not a one off cost: as the technologies change,
the "fast" way to do things will change and the code will have to be
updated to match.  Better to leave this to the OS vendor where
possible, and take advantage of the tuning they do.

> - Anyone other ideas -> the sky is the limit here

> It also aids portability, at least on platforms that have an
> equivalent of a raw device.

I don't understand that claim.  Not much is portable about raw
devices, and they're typically not nearlly as well documented as the
filesystem interfaces.

> It is also independent of the standard implemented Un*x filesystems,
> for which you will have to pay extra if you want to take extra
> measures against power loss.

Rather, it is worse.  With a Unix filesystem you get quite defined
semantics about what is written when.

> The problem with e.g. e2fs, is that it is not robust enough if a CPU
> fails.

ext2fs doesn't even claim to have Unix filesystem semantics.

Regards,

Giles



pgsql-general by date:

Previous
From: Tyler Robert Wood
Date:
Subject: copying table to a file
Next
From: Giles Lean
Date:
Subject: Re: postmaster logs