Re: [Testperf-general] Re: ExclusiveLock - Mailing list pgsql-hackers

From Kenneth Marshall
Subject Re: [Testperf-general] Re: ExclusiveLock
Date
Msg-id 20041124164353.GA895@it.is.rice.edu
Whole thread Raw
In response to Re: [Testperf-general] Re: ExclusiveLock  ("Bort, Paul" <pbort@tmwsystems.com>)
List pgsql-hackers
On Wed, Nov 24, 2004 at 11:00:30AM -0500, Bort, Paul wrote:
> > From: Kenneth Marshall [mailto:ktm@is.rice.edu]
> [snip]
> > The simplest idea I had was to pre-layout the WAL logs in a 
> > contiguous fashion
> > on the disk. Solaris has this ability given appropriate FS 
> > parameters and we
> > should be able to get close on most other OSes. Once that has 
> > happened, use
> > something like the FSM map to show the allocated blocks. The 
> > CPU can keep track
> > of its current disk rotational position (approx. is okay) 
> > then when we need to
> > write a WAL block start writing at the next area that the 
> > disk head will be
> > sweeping. Give it a little leaway for latency in the system 
> > and we should be
> > able to get very low latency for the writes. Obviously, there 
> > would be wasted
> > space but you could intersperse writes to the granularity of 
> > space overhead
> > that you would like to see. As far as implementation, I was reading an
> > interesting article that used a simple theoretical model to 
> > estimate disk head
> > position to avoid latency.
> > 
> 
> Ken, 
> 
> That's a neat idea, but I'm not sure how much good it will do. As bad as
> rotational latency is, seek time is worse. Pre-allocation isn't going to do
> much for rotational latency if the heads also have to seek back to the WAL. 
> 
> OTOH, pre-allocation could help two other performance aspects of the WAL:
> First, if the WAL was pre-allocated, steps could be taken (by the operator,
> based on their OS) to make the space allocated to the WAL contiguous.
> Statistics on how much WAL is needed in 24 hours would help with that
> sizing. This would reduce seeks involved in writing the WAL data.
> 
> The other thing it would do is reduce seeks and metadata writes involved in
> extending WAL files.
> 
> All of this is moot if the WAL doesn't have its own spindle(s).
> 
> This almost leads back to the old-fashioned idea of using a raw partition,
> to avoid the overhead of the OS and file structure. 
> 
> Or I could be thoroughly demonstrating my complete lack of understanding of
> PostgreSQL internals. :-)
> 
> Maybe I'll get a chance to try the flash drive WAL idea in the next couple
> of weeks. Need to see if the hardware guys have a spare flash drive I can
> abuse.
> 
> Paul
> 

Obviously, this whole process would be much more effective on systems with
separate WAL drives. But even on less busy systems, the lock-step of
write-a-WAL/wait-for-heads/write-a-WAL can dramatically decrease your
effective throughput to the drive. For example, the worst case would be
write one WAL block to disk. Then schedule another WAL block to be written
to disk. This block will need to wait for 1 full disk rotation to perform
the write. On a 10k drive, you will be able to log in this scenario 166
TPS assuming no piggy-backed syncs. Now look at the case where we can use
the preallocated WAL and write immediately. Assuming a 100% sequential disk
layout, if we can start writing within 25% of the full rotation we can now
support 664 TPS on the same hardware. Now look at a typical hard drive on
my desktop system with 150M sectors/4 heads/50000 tracks -> 3000 blocks/track
or 375 8K blocks. If we can write the next block within 10 8K blocks we can
perform 6225 TPS, within 5 8K blocks = 12450 TPS, within 2 8K blocks =
31125 TPS. This is just on a simple disk drive. As you can see, even small
improvements can make a tremendous difference in throughput. My analysis
is very simplistic and whether we can model the I/O quickly enough to be
useful is still to be determined. Maybe someone on the mailing list with
more experiance in how disk drives actually function can provide more
definitive information.

Ken



pgsql-hackers by date:

Previous
From: Kenneth Marshall
Date:
Subject: Solaris 8 regression test failure with 8.0.0beta5
Next
From: "Marc G. Fournier"
Date:
Subject: Re: Beta5 now Available