AW: WAL and raw devices (was: volume management) - Mailing list pgsql-hackers

From Zeugswetter Andreas SB
Subject AW: WAL and raw devices (was: volume management)
Date
Msg-id 11C1E6749A55D411A9670001FA6879633682C5@sdexcsrv1.f000.d0188.sd.spardat.at
Whole thread Raw
List pgsql-hackers
> > As an aside, I do however think, that optimizing the O_SYNC path of
> > the WAL code to block writes to larger blocks (doesn't need to be
> > more than 256k) would lead to nearly the same performance as a raw
> > device on most filesystems. (Maybe also add code to reuse backed up
> > logfiles to avoid the need to preallocate space) Imho this is the part
> > of the code where the brainwork should first be put into. It is also a
> > prerequisite to make raw devices fast, since if you write 8k blocks to
> > a raw device, that is slow (not faster than a fs).
> 
> You cannot block writes to the WAL without blocking transactions waiting 
> on the write, because completion of that write is necessary for the 
> transaction to complete.

Yes, this is obvious, but:

You *can* block writes into larger blocks as long as no commit comes 
inbetween. This essentially increases performance e.g. for bulk loads
where single transactions are > 8k of WAL. A typical example is even in the 
regression test, the "copy ... from" statements. They really suffer from
the O_SYNC mode. This mode is essentially what you would have now for a
raw device WAL.

> Moving the WAL volume's disk head into position is the major investment 
> you are amortizing with your large blocks.   If the head is already in 
> position, it is about as efficient to write a little as to write a lot.

This is only half of the story for large transactions. For large transactions
you need to write more than the current 8k in one call (only in the raw device, 
or O_SYNC mode of course). Writing in large blocks also helps the fs to reduce 
head movement. After every write call the OS suspends the current 
process, and makes room for another backend to e.g read a block on the same drive, 
thus forcing head movement.

I suggest you do some tests with raw devices, which I already did, to see what happens
if you only write 8k blocks (you only get 50-60% performance compared to 256k).

The IO performance gain you can achieve on a raw device compared to a 
preallocated filesystem file is imho neglectible. e.g. on AIX it is due to a global
kernel parameter, that defaults to a max 32k block size for read ahead and write behind. 
I noted the advantages in a previous thread about why Oracle wants raw devices,
and I don't think they are worth it at the current state of PostgreSQL.  
Andreas


pgsql-hackers by date:

Previous
From: The Hermit Hacker
Date:
Subject: Re: 7.1.2 release
Next
From: 施銘斌
Date:
Subject: Can PostgreSQL's Stored Procedure return a ReccordSet?