Thread: File system performance and pg_xlog

File system performance and pg_xlog

From

mlw

Date:

05 May 2001, 13:07:37

A small debate started with bad performance on ReiserFS. I pondered the likely
advantages to raw device access. It also occured to me that the FAT file system
is about as close to a managed raw device as one could get. So I did some
tests:
The hardware:

A PII system running Linux 7.0, with 2.2.16-2.
256M RAM
IDE home hard disk.
Adaptec 2740 with two SCSI drives
A 9G Seagate ST19171W as /dev/sda1 mounted as /sda1
A 4G Seagate ST15150W as /dev/sdb1 mounted as /sdb1
/sda1 has a ext2 file system, and is used as "base" with a symlink.
/sdb1 is either an ext2 or FAT file system used as "pg_xlog" with a symlink.


In a clean Postgres environment, I initialized pgbench as:
./pgbench -i -s 10 -d pgbench

I used this script to produce the results:

psql -U mohawk pgbench -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 1"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 2"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 3"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 4"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 5"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 6"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 7"
psql -U mohawk pgbench  -c "checkpoint; "
su mohawk -c "./pgbench -d pgbench -t 32 -c 8"

(My postgres user is "mohawk")

I had to modify xlog.c to use "rename" instead of link. And I had to explicitly
set ownership of the FAT file system to the postgres user during mount.

I ran the script twice as:

./test.sh > ext2.log

(Then rebuilt a fresh database and formatted sdb1 as fat)
./test.sh > fat.log

Here is a diff of the two runs:

--- ext2.log    Sat May  5 12:58:07 2001
+++ fat.log    Sat May  5 12:58:07 2001
@@ -5,8 +5,8 @@number of clients: 1number of transactions per client: 32number of transactions actually processed:
32/32
-tps = 18.697006(including connections establishing)
-tps = 19.193225(excluding connections establishing)
+tps = 37.439512(including connections establishing)
+tps = 39.710461(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 2 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -14,8 +14,8 @@number of clients: 2number of transactions per client: 32number of transactions actually processed:
64/64
-tps = 32.444226(including connections establishing)
-tps = 33.499452(excluding connections establishing)
+tps = 44.782177(including connections establishing)
+tps = 46.799328(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 3 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -23,8 +23,8 @@number of clients: 3number of transactions per client: 32number of transactions actually processed:
96/96
-tps = 43.042861(including connections establishing)
-tps = 44.816086(excluding connections establishing)
+tps = 55.416117(including connections establishing)
+tps = 58.057013(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 4 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -32,8 +32,8 @@number of clients: 4number of transactions per client: 32number of transactions actually processed:
128/128
-tps = 46.033959(including connections establishing)
-tps = 47.681683(excluding connections establishing)
+tps = 61.752368(including connections establishing)
+tps = 64.796970(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 5 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -41,8 +41,8 @@number of clients: 5number of transactions per client: 32number of transactions actually processed:
160/160
-tps = 49.980258(including connections establishing)
-tps = 51.874653(excluding connections establishing)
+tps = 63.124090(including connections establishing)
+tps = 67.225563(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 6 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -50,8 +50,8 @@number of clients: 6number of transactions per client: 32number of transactions actually processed:
192/192
-tps = 51.800192(including connections establishing)
-tps = 53.752739(excluding connections establishing)
+tps = 65.452545(including connections establishing)
+tps = 68.741933(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 7 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -59,8 +59,8 @@number of clients: 7number of transactions per client: 32number of transactions actually processed:
224/224
-tps = 52.652660(including connections establishing)
-tps = 54.616802(excluding connections establishing)
+tps = 66.525419(including connections establishing)
+tps = 69.727409(excluding connections establishing)CHECKPOINTpghost: (null) pgport: (null) nclients: 8 nxacts: 32
dbName:pgbenchtransaction type: TPC-B (sort of)
 
@@ -68,5 +68,5 @@number of clients: 8number of transactions per client: 32number of transactions actually processed:
256/256
-tps = 55.440884(including connections establishing)
-tps = 57.525931(excluding connections establishing)
+tps = 67.331052(including connections establishing)
+tps = 70.575482(excluding connections establishing)

Re: File system performance and pg_xlog

From

Marko Kreen

Date:

05 May 2001, 16:00:09

On Sat, May 05, 2001 at 01:09:38PM -0400, mlw wrote:
> A small debate started with bad performance on ReiserFS. I pondered the likely
> advantages to raw device access. It also occured to me that the FAT file system
> is about as close to a managed raw device as one could get. So I did some
> tests:

> /sdb1 is either an ext2 or FAT file system used as "pg_xlog" with a symlink.

One little thought: does mounting ext2 with 'noatime' makes any
difference?  AFAIK fat does not have concept of atime, so then
it would be more fair?  Just a thought.

-- 
marko

Re: File system performance and pg_xlog

From

mlw

Date:

05 May 2001, 18:40:04

Marko Kreen wrote:
> 
> On Sat, May 05, 2001 at 01:09:38PM -0400, mlw wrote:
> > A small debate started with bad performance on ReiserFS. I pondered the likely
> > advantages to raw device access. It also occured to me that the FAT file system
> > is about as close to a managed raw device as one could get. So I did some
> > tests:
> 
> > /sdb1 is either an ext2 or FAT file system used as "pg_xlog" with a symlink.
> 
> One little thought: does mounting ext2 with 'noatime' makes any
> difference?  AFAIK fat does not have concept of atime, so then
> it would be more fair?  Just a thought.
> 
> --
> marko

I don't know, and I haven't tried that, but I suspect that it won't make much
difference. 

While I do not think that anyone would seriously consider using FAT for xlog,
I'd have problems considering myself, it in a production environment, the
numbers do say something about the nature of WAL. A bunch of files, all the
same size, is practically what FAT does best. Plus there is no real overhead.

The very reasons why FAT is a POS file system are the same reasons it would
work great for WAL, with the only caveat being that fsync is implemented, and
the application (postgres) maintains its own data integrity.

Oddly enough, I did not see any performance improvement using FAT for the
"base" directory. That may be the nature of the pg block size vs cluster size,
fragmentation, and stuff. If I get some time I will investigate it a bit more.

Clearly not everyone would be interested in this. PG seems to be used for
everything from a small personal db, to a system component db -- like on a web
box, to a full blown stand-alone server. The first two applications may not be
interested in this sort of stuff, but last category, the "full blown server"
would certainly want to squeeze as much out of their system as possible.

I think a "pgfs" could easily be a derivative of FAT, or even FAT with some
Ioctls.  It is simple, it is fast, it does not attempt to do things postgres
doesn't need.

Re: File system performance and pg_xlog

From

Marko Kreen

Date:

05 May 2001, 20:26:26

On Sat, May 05, 2001 at 06:43:51PM -0400, mlw wrote:
> Marko Kreen wrote:
> > On Sat, May 05, 2001 at 01:09:38PM -0400, mlw wrote:
> > > A small debate started with bad performance on ReiserFS. I pondered the likely
> > > advantages to raw device access. It also occured to me that the FAT file system
> > > is about as close to a managed raw device as one could get. So I did some
> > > tests:

> I think a "pgfs" could easily be a derivative of FAT, or even FAT with some
> Ioctls.  It is simple, it is fast, it does not attempt to do things postgres
> doesn't need.

Well, my opinion too is that it is waste of resources to try
implement PostgreSQL-specific filesystem.  As you already showed
that there are noticeable differences of different filesystems,
the Right Thing would be to make a FAQ/web-page/knowledge-base
of comments on different filesystem in point of view of DB
(PostgreSQL) server.

Also users will have different priorities:
reliability/speed-of-reads/speed-of-writes - I mean different
users have them ordered differently - so it should be mentioned
this fs is good for this but bad on this, etc...  It is good
to put this part of db on this fs but not that part of db...
Suggestions on mount flags to use...

There already exist bazillion filesystems, _some_ of them should
be usable for PostgreSQL too :)

Besides resource waste there are others problems with app-level
fs:

* double-buffering and incompatibilities of avoiding that
* a lot of code should be reimplemented that already exists in today's OS'es
* you lose all of UNIX user-space tools
* the speed difference will not be very big.  Remeber: it _was_ big on OS'es and fs' in year 1990.  Today's fs are lot
ofbetter and there should be a os/fs combo that is 95% perfect.

-- 
marko

Utilizing "direct writes" Re: File system performance and pg_xlog

From

Alfred Perlstein

Date:

05 May 2001, 22:01:50

* Marko Kreen <marko@l-t.ee> [010505 17:39] wrote:
> 
> There already exist bazillion filesystems, _some_ of them should
> be usable for PostgreSQL too :)
> 
> Besides resource waste there are others problems with app-level
> fs:
> 
> * double-buffering and incompatibilities of avoiding that

Depends on the OS, most Operating systems like FreeBSD and Solaris
offer character device access, this means that the OS will DMA
directly from the process's address space.  Avoiding the double
copy is trivial except that one must align and size writes correctly,
generally on 512 byte boundries and in 512 byte increments.

> * a lot of code should be reimplemented that already exists
>   in today's OS'es

That's true.

> * you lose all of UNIX user-space tools

Even worse. :)

> * the speed difference will not be very big.  Remeber: it _was_
>   big on OS'es and fs' in year 1990.  Today's fs are lot of
>   better and there should be a os/fs combo that is 95% perfect.

Well, here's an idea, has anyone tried using the "direct write"
interface that some OS's offer?  I doubt FreeBSD does, but I'm
positive that Solaris offers it as well as possibly IRIX.

-- 
-Alfred Perlstein - [alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/

Re: File system performance and pg_xlog

From

mlw

Date:

05 May 2001, 22:06:50

Marko Kreen wrote:
> 
> On Sat, May 05, 2001 at 06:43:51PM -0400, mlw wrote:
> > Marko Kreen wrote:
> > > On Sat, May 05, 2001 at 01:09:38PM -0400, mlw wrote:
> > > > A small debate started with bad performance on ReiserFS. I pondered the likely
> > > > advantages to raw device access. It also occured to me that the FAT file system
> > > > is about as close to a managed raw device as one could get. So I did some
> > > > tests:
> 
> > I think a "pgfs" could easily be a derivative of FAT, or even FAT with some
> > Ioctls.  It is simple, it is fast, it does not attempt to do things postgres
> > doesn't need.
> 
> Well, my opinion too is that it is waste of resources to try
> implement PostgreSQL-specific filesystem.  As you already showed
> that there are noticeable differences of different filesystems,
> the Right Thing would be to make a FAQ/web-page/knowledge-base
> of comments on different filesystem in point of view of DB
> (PostgreSQL) server.
> 
> Also users will have different priorities:
> reliability/speed-of-reads/speed-of-writes - I mean different
> users have them ordered differently - so it should be mentioned
> this fs is good for this but bad on this, etc...  It is good
> to put this part of db on this fs but not that part of db...
> Suggestions on mount flags to use...

I think it is simpler problem than that. Postgres, with fsync enabled, does a
lot of work trying to maintain data integrity. It is logical to conclude that a
file system that does as little as possible would almost always perform better.
Regardless of what the file system does, eventually it writes blocks of data to
sectors on a disk.

Many databases use their own data volume management. I am not suggesting that
anyone create a new file system, but after performing some tests, I am really
starting to see why products like oracle manage their own table spaces.

If one looks at the FAT file system with an open mind and a clear understanding
of how it will be used, some small modifications may make it the functional
equivalent of a managed table space volume, at least under Linux.

Some of the benchmark numbers are hovering around 20% improvement! That's
nothing to sneeze at. I have a database loader that does a select nextval(..)
followed by a begin, a series of inserts, followed by a commit.

With xlog on a FAT file system, I can get 53-60 sets per second. With Xlog
sitting on ext2, I can get 40-45 sets per second. (Of the same data) These are
not insignificant improvements, and should be examined. If not from a Postgres
development perspective, at least from a deployment perspective.

> 
> There already exist bazillion filesystems, _some_ of them should
> be usable for PostgreSQL too :)

I agree.

-- 
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

Re: Utilizing "direct writes" Re: File system performance and pg_xlog

From

Marko Kreen

Date:

06 May 2001, 06:40:53

On Sat, May 05, 2001 at 07:01:35PM -0700, Alfred Perlstein wrote:
> * Marko Kreen <marko@l-t.ee> [010505 17:39] wrote:
> > * double-buffering and incompatibilities of avoiding that
> 
> Depends on the OS, most Operating systems like FreeBSD and Solaris
> offer character device access, this means that the OS will DMA
> directly from the process's address space.  Avoiding the double
> copy is trivial except that one must align and size writes correctly,
> generally on 512 byte boundries and in 512 byte increments.

PostgreSQL must then also think about write ordering very hard,
atm this OS business.

> > * the speed difference will not be very big.  Remeber: it _was_
> >   big on OS'es and fs' in year 1990.  Today's fs are lot of
> >   better and there should be a os/fs combo that is 95% perfect.
> 
> Well, here's an idea, has anyone tried using the "direct write"
> interface that some OS's offer?  I doubt FreeBSD does, but I'm
> positive that Solaris offers it as well as possibly IRIX.

And how much it differs from using FAT?  Thats the point I
want to make.  There should be already a fs that is 90% close
that.

-- 
marko

Re: File system performance and pg_xlog

From

Marko Kreen

Date:

06 May 2001, 06:41:02

On Sat, May 05, 2001 at 10:10:33PM -0400, mlw wrote:
> I think it is simpler problem than that. Postgres, with fsync enabled, does a
> lot of work trying to maintain data integrity. It is logical to conclude that a
> file system that does as little as possible would almost always perform better.
> Regardless of what the file system does, eventually it writes blocks of data to
> sectors on a disk.

But there's more, when PostgreSQL today 'uses a fs' it also get
all the caching/optimizing algorithms in os kernel 'for free'.

> Many databases use their own data volume management. I am not suggesting that
> anyone create a new file system, but after performing some tests, I am really
> starting to see why products like oracle manage their own table spaces.
> 
> If one looks at the FAT file system with an open mind and a clear understanding
> of how it will be used, some small modifications may make it the functional
> equivalent of a managed table space volume, at least under Linux.

Are you talking about new in-kernel fs?  Lets see, how many
os'es PostgreSQL today supports?

> With xlog on a FAT file system, I can get 53-60 sets per second. With Xlog
> sitting on ext2, I can get 40-45 sets per second. (Of the same data) These are
> not insignificant improvements, and should be examined. If not from a Postgres
> development perspective, at least from a deployment perspective.

Yes, therefore a proposed a 'knowledge-base' where such things
could be mentioned.

-- 
marko

Re: Utilizing "direct writes" Re: File system performance and pg_xlog

From

Alfred Perlstein

Date:

06 May 2001, 11:38:24

* Marko Kreen <marko@l-t.ee> [010506 03:33] wrote:
> On Sat, May 05, 2001 at 07:01:35PM -0700, Alfred Perlstein wrote:
> > * Marko Kreen <marko@l-t.ee> [010505 17:39] wrote:
> > > * double-buffering and incompatibilities of avoiding that
> > 
> > Depends on the OS, most Operating systems like FreeBSD and Solaris
> > offer character device access, this means that the OS will DMA
> > directly from the process's address space.  Avoiding the double
> > copy is trivial except that one must align and size writes correctly,
> > generally on 512 byte boundries and in 512 byte increments.
> 
> PostgreSQL must then also think about write ordering very hard,
> atm this OS business.

Depends. :)

> 
> > > * the speed difference will not be very big.  Remeber: it _was_
> > >   big on OS'es and fs' in year 1990.  Today's fs are lot of
> > >   better and there should be a os/fs combo that is 95% perfect.
> > 
> > Well, here's an idea, has anyone tried using the "direct write"
> > interface that some OS's offer?  I doubt FreeBSD does, but I'm
> > positive that Solaris offers it as well as possibly IRIX.
> 
> And how much it differs from using FAT?  Thats the point I
> want to make.  There should be already a fs that is 90% close
> that.

Using FAT is totally up to the vendor's FAT implementation.
Solaris FAT will cache data for a file as long as it's open
which sort of defeats the purpose.  Maybe Linux's caching
methods are less effective or have less overhead making FAT
under Linux a win.

One of the problems is that I don't think most vendors consider
thier FAT implementation to be "mission critical", it's possible
that bugs may be present.

Does anyone have that test suite that was just mentioned for
benching Postgresql?  (I'd like to try FreeBSD FAT).

-- 
-Alfred Perlstein - [alfred@freebsd.org]
Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.

Re: File system performance and pg_xlog (More info)

From

mlw

Date:

06 May 2001, 16:42:28

Well, as my tests continue, and I try to understand the nature of how file
system design affects Postgres, I did notice something disturbing.

On a single processor machine, Linux kernel 2.2x, and a good Adaptec SCSI
system and disks, the results were a clear win. When you put WAL on FAT32 on
its own disk, it ranges between 10% and 20% improvement.

My other machine, which is semi-production, I don't want to screw too much with
the OS, layout, etc. has a Paradise ATA-66 and two ATA-100 disks, which perform
quite well, usually. It is an SMP PIII 600, 512M RAM.

Using FAT32 was horrible, one tenth the performance of ext2. Perhaps this is
because FAT has one HUGE spinlock, where as ext2 has a better granularity? I
don't know, maybe I will get off my butt and examine the code later.

One thing is perfectly clear, file systems have a huge impact. While it may not
be an argument for writing a "pgfs," it is a clear indicator that optimal
performance is non-trivial and requires a bit of screwing around and
understanding what's best.

Personally, I would fear a "pgfs." Writing a kernel component would be a bad
idea. FAT has potential, but I don't think kernel developers put any serious
thought into it, so I don't think it is a mission critical component in most
cases. Just the behavior that I saw with FAT on SMP Linux, tells me to be
careful.

Postgres is at the mercy of the file systems, WAL make it even more so. My gut
tells me that this aspect of the project will refuse to be taken lightly.


-- 
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

Re: File system performance and pg_xlog

From

teg@redhat.com (Trond Eivind Glomsrød)

Date:

07 May 2001, 11:31:51

Marko Kreen <marko@l-t.ee> writes:

> On Sat, May 05, 2001 at 10:10:33PM -0400, mlw wrote:
> > I think it is simpler problem than that. Postgres, with fsync enabled, does a
> > lot of work trying to maintain data integrity. It is logical to conclude that a
> > file system that does as little as possible would almost always perform better.
> > Regardless of what the file system does, eventually it writes blocks of data to
> > sectors on a disk.
> 
> But there's more, when PostgreSQL today 'uses a fs' it also get
> all the caching/optimizing algorithms in os kernel 'for free'.
> 
> > Many databases use their own data volume management. I am not suggesting that
> > anyone create a new file system, but after performing some tests, I am really
> > starting to see why products like oracle manage their own table spaces.
> > 
> > If one looks at the FAT file system with an open mind and a clear understanding
> > of how it will be used, some small modifications may make it the functional
> > equivalent of a managed table space volume, at least under Linux.
> 
> Are you talking about new in-kernel fs?  Lets see, how many
> os'es PostgreSQL today supports?

If you're using raw devices on Linux and get a win there, it's a win
for Postgresql on Linux. This is important for everyone using it on
this platform (probably a big chunk of the users). And who uses all
the new features and performance enhancements done in other ways?

It all comes down to if it actually would give a performance boost,
how much work it is and if someone wants to do it.
> 

-- 
Trond Eivind Glomsrød
Red Hat, Inc.

Re: File system performance and pg_xlog

From

Bruce Momjian

Date:

07 May 2001, 12:33:28

> If one looks at the FAT file system with an open mind and a clear understanding
> of how it will be used, some small modifications may make it the functional
> equivalent of a managed table space volume, at least under Linux.

Can I ask if we are talking FAT16 (DOS) or FAT32 (NT)?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: File system performance and pg_xlog

From

Bruce Momjian

Date:

07 May 2001, 12:36:36

One big performance issue is that PostgreSQL 7.1 uses fdatasync if it is
available.  However, according to RedHat, 2.2 Linux kernels have
fdatasync, but it really just acts as fsync.  In 2.4 kernels, fdatasync
is really fdatasync, I think.

That is a major issue for people running performance tests.  For
example, XFS may be slow on 2.2 kernels but not 2.4 kernels.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: File system performance and pg_xlog

From

Tom Lane

Date:

07 May 2001, 12:45:01

teg@redhat.com (Trond Eivind Glomsrød) writes:
> If you're using raw devices on Linux and get a win there, it's a win
> for Postgresql on Linux. ...
> It all comes down to if it actually would give a performance boost,
> how much work it is and if someone wants to do it.

No, those are not the only considerations.  If the feature is not
portable then we also have to consider how much of a headache it'll be
to maintain in parallel with a more portable approach.  We might reject
such a feature even if it's a clear win for Linux, if it creates enough
problems elsewhere.  Postgres is *not* a Linux-only application, and I
trust it never will be.
        regards, tom lane

PS: that's not meant to reject the idea out-of-hand; perhaps the
benefits will prove to be so large that we will want to do it
anyway.  I'm just trying to counter what appears to be a narrowly
platform-centric view of the issues.

Re: File system performance and pg_xlog

From

teg@redhat.com (Trond Eivind Glomsrød)

Date:

07 May 2001, 13:16:03

Tom Lane <tgl@sss.pgh.pa.us> writes:

> teg@redhat.com (Trond Eivind Glomsrød) writes:
> > If you're using raw devices on Linux and get a win there, it's a win
> > for Postgresql on Linux. ...
> > It all comes down to if it actually would give a performance boost,
> > how much work it is and if someone wants to do it.
> 
> No, those are not the only considerations.  If the feature is not
> portable then we also have to consider how much of a headache it'll be
> to maintain in parallel with a more portable approach.

Cleanliness and code quality are obvious requirements.

> We might reject such a feature even if it's a clear win for Linux,
> if it creates enough problems elsewhere.  Postgres is *not* a
> Linux-only application, and I trust it never will be.

No, but if Linux-specific approach gives a 100% performance boost,
it's probably worth doing. At 1% it probably isn't. Same goes for
FreeBSD and others.

-- 
Trond Eivind Glomsrød
Red Hat, Inc.

Re: File system performance and pg_xlog

From

teg@redhat.com (Trond Eivind Glomsrød)

Date:

07 May 2001, 13:31:43

Bruce Momjian <pgman@candle.pha.pa.us> writes:

> That is a major issue for people running performance tests.  For
> example, XFS may be slow on 2.2 kernels but not 2.4 kernels.

XFS is 2.4 only, AFAIK - even the installer modifications SGI did to
Red Hat Linux 7 (which is shipped with a 2.2 kernel) includes
installing a 2.4pre kernel, AFAIR.

-- 
Trond Eivind Glomsrød
Red Hat, Inc.

Re: File system performance and pg_xlog

From

Marko Kreen

Date:

07 May 2001, 13:53:05

On Mon, May 07, 2001 at 12:12:43PM -0400, Bruce Momjian wrote:
> > If one looks at the FAT file system with an open mind and a clear understanding
> > of how it will be used, some small modifications may make it the functional
> > equivalent of a managed table space volume, at least under Linux.
> 
> Can I ask if we are talking FAT16 (DOS) or FAT32 (NT)?

Does not matter.  Arhitecture is same.  FAT16 is not DOS-only,
and FAT32 is not NT-only.  And there is VFAT16 and VFAT32...

Point 1 in this discussion seems to be that for storing WAL
files on a FAT-like fs seems to be better (less overhead) than
ext2/ufs like fs.

Point 2: as vendors do not think of FAT as critical fs, it is
probably not very optimised for things like SMP; also reliability
(this probably comes from FAT design itself (thats why it has
probably less overhead too...)).

Point 3: as FAT-like fs's are probably least-overhead
fs's, could we get any better with a pgfs implementation?

Conclusion: ?

-- 
marko

Re: Re: File system performance and pg_xlog (More info)

From

Bruce Momjian

Date:

07 May 2001, 14:39:49

> Personally, I would fear a "pgfs." Writing a kernel component would be a bad
> idea. FAT has potential, but I don't think kernel developers put any serious
> thought into it, so I don't think it is a mission critical component in most
> cases. Just the behavior that I saw with FAT on SMP Linux, tells me to be
> careful.
> 
> Postgres is at the mercy of the file systems, WAL make it even more so. My gut
> tells me that this aspect of the project will refuse to be taken lightly.

From a portability standpoint, I think if we go anywhere, it would be to
write directly into device files representing sections of a disk.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Re: File system performance and pg_xlog (More info)

From

Doug McNaught

Date:

07 May 2001, 15:16:57

Bruce Momjian <pgman@candle.pha.pa.us> writes:

> >From a portability standpoint, I think if we go anywhere, it would be to
> write directly into device files representing sections of a disk.

That makes sense to me.  On "traditional" Unices, we could use the raw 
character device for a partition (eg /dev/rdsk/* on Solaris), and on
Linux we'd use /dev/raw*, which is a mapping to a specific partition
established before PG startup. 

I guess there would need to be a system table that keeps track of 
(dev, offset, size) tuples for each WAL file.

-Doug
-- 
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time...          --Dylan

Re: Re: File system performance and pg_xlog (More info)

From

Steve Wampler

Date:

07 May 2001, 16:49:43

Doug McNaught wrote:
> 
> That makes sense to me.  On "traditional" Unices, we could use the raw
> character device for a partition (eg /dev/rdsk/* on Solaris), and on
> Linux we'd use /dev/raw*, which is a mapping to a specific partition
> established before PG startup.

Small update - newer Linux kernels now support multiple raw devices
through /dev/raw/raw*, though the mapping between raw (character)
and block devices has to be recreated on each boot.

--
Steve Wampler-  SOLIS Project, National Solar Observatory
swampler@noao.edu

Re: File system performance and pg_xlog (More info)

From

mlw

Date:

07 May 2001, 18:13:16

Steve Wampler wrote:
> 
> Doug McNaught wrote:
> >
> > That makes sense to me.  On "traditional" Unices, we could use the raw
> > character device for a partition (eg /dev/rdsk/* on Solaris), and on
> > Linux we'd use /dev/raw*, which is a mapping to a specific partition
> > established before PG startup.
> 
> Small update - newer Linux kernels now support multiple raw devices
> through /dev/raw/raw*, though the mapping between raw (character)
> and block devices has to be recreated on each boot.

It would be very easy to do a lot of experimenting, and perhaps even more
efficient in the long run if we could:

pre-allocate table spaces, rather than only letting a table file grow, why not
allow pre-allocated tables files. I want xyz table to be 2.2G long. That way a
file system doesn't care about the periphery of a file.

ALTER [table index] name PREALLOCATE nn BLOCKS;

Vacuuming and space reuse would be an issue. You would propbably have to
implement a defragment routine, or some sort of free block list.
After the preallocated limit is hit, grow the file normally. 

Second, allow tables and indexes to be created with arbitrary file names,
something like:

create table foo (this integer, that varchar) as file '/path/file';
create index foo_ndx on foo (this) as file '/path2/file1';

If you do not specify a file, then it behaves as before.

I suspect that these sorts of modifications are either easy or hard. There
never is a middle ground on changes like this. The file name one is probably
easier than the preallocated block one.

-- 
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

Re: File system performance and pg_xlog

From

"Mark L. Woodward"

Date:

11 May 2001, 09:16:31

Bruce Momjian wrote:

> > If one looks at the FAT file system with an open mind and a clear understanding
> > of how it will be used, some small modifications may make it the functional
> > equivalent of a managed table space volume, at least under Linux.
>
> Can I ask if we are talking FAT16 (DOS) or FAT32 (NT)

I used FAT32 in my tests.

On a side note, FAT32 is actually DOS. It showed up in Windows 95b and wasn't
supported in NT until Win2K.

I guess, what I have been trying to say, is that we all know it all comes down to
disk I/O at some point. Reducing the number of sequencial disk I/O operations for
each transaction will improve performence. Maybe choosing a simple file system will
accomplish this.