Thread: Re: [Testperf-general] Re: ExclusiveLock

Re: [Testperf-general] Re: ExclusiveLock

From
"Bort, Paul"
Date:
<p><font size="2">> From: Kenneth Marshall [<a href="mailto:ktm@is.rice.edu">mailto:ktm@is.rice.edu</a>]</font><br
/><fontsize="2">[snip]</font><br /><font size="2">> The simplest idea I had was to pre-layout the WAL logs in a
</font><br/><font size="2">> contiguous fashion</font><br /><font size="2">> on the disk. Solaris has this
abilitygiven appropriate FS </font><br /><font size="2">> parameters and we</font><br /><font size="2">> should
beable to get close on most other OSes. Once that has </font><br /><font size="2">> happened, use</font><br /><font
size="2">>something like the FSM map to show the allocated blocks. The </font><br /><font size="2">> CPU can keep
track</font><br/><font size="2">> of its current disk rotational position (approx. is okay) </font><br /><font
size="2">>then when we need to</font><br /><font size="2">> write a WAL block start writing at the next area that
the</font><br /><font size="2">> disk head will be</font><br /><font size="2">> sweeping. Give it a little leaway
forlatency in the system </font><br /><font size="2">> and we should be</font><br /><font size="2">> able to get
verylow latency for the writes. Obviously, there </font><br /><font size="2">> would be wasted</font><br /><font
size="2">>space but you could intersperse writes to the granularity of </font><br /><font size="2">> space
overhead</font><br/><font size="2">> that you would like to see. As far as implementation, I was reading
an</font><br/><font size="2">> interesting article that used a simple theoretical model to </font><br /><font
size="2">>estimate disk head</font><br /><font size="2">> position to avoid latency.</font><br /><font
size="2">></font><p><font size="2">Ken, </font><p><font size="2">That's a neat idea, but I'm not sure how much good
itwill do. As bad as rotational latency is, seek time is worse. Pre-allocation isn't going to do much for rotational
latencyif the heads also have to seek back to the WAL. </font><p><font size="2">OTOH, pre-allocation could help two
otherperformance aspects of the WAL: First, if the WAL was pre-allocated, steps could be taken (by the operator, based
ontheir OS) to make the space allocated to the WAL contiguous. Statistics on how much WAL is needed in 24 hours would
helpwith that sizing. This would reduce seeks involved in writing the WAL data.</font><p><font size="2">The other thing
itwould do is reduce seeks and metadata writes involved in extending WAL files.</font><p><font size="2">All of this is
mootif the WAL doesn't have its own spindle(s).</font><p><font size="2">This almost leads back to the old-fashioned
ideaof using a raw partition, to avoid the overhead of the OS and file structure. </font><p><font size="2">Or I could
bethoroughly demonstrating my complete lack of understanding of PostgreSQL internals. :-)</font><p><font size="2">Maybe
I'llget a chance to try the flash drive WAL idea in the next couple of weeks. Need to see if the hardware guys have a
spareflash drive I can abuse.</font><p><font size="2">Paul</font> 

Re: [Testperf-general] Re: ExclusiveLock

From
Kenneth Marshall
Date:
On Wed, Nov 24, 2004 at 11:00:30AM -0500, Bort, Paul wrote:
> > From: Kenneth Marshall [mailto:ktm@is.rice.edu]
> [snip]
> > The simplest idea I had was to pre-layout the WAL logs in a 
> > contiguous fashion
> > on the disk. Solaris has this ability given appropriate FS 
> > parameters and we
> > should be able to get close on most other OSes. Once that has 
> > happened, use
> > something like the FSM map to show the allocated blocks. The 
> > CPU can keep track
> > of its current disk rotational position (approx. is okay) 
> > then when we need to
> > write a WAL block start writing at the next area that the 
> > disk head will be
> > sweeping. Give it a little leaway for latency in the system 
> > and we should be
> > able to get very low latency for the writes. Obviously, there 
> > would be wasted
> > space but you could intersperse writes to the granularity of 
> > space overhead
> > that you would like to see. As far as implementation, I was reading an
> > interesting article that used a simple theoretical model to 
> > estimate disk head
> > position to avoid latency.
> > 
> 
> Ken, 
> 
> That's a neat idea, but I'm not sure how much good it will do. As bad as
> rotational latency is, seek time is worse. Pre-allocation isn't going to do
> much for rotational latency if the heads also have to seek back to the WAL. 
> 
> OTOH, pre-allocation could help two other performance aspects of the WAL:
> First, if the WAL was pre-allocated, steps could be taken (by the operator,
> based on their OS) to make the space allocated to the WAL contiguous.
> Statistics on how much WAL is needed in 24 hours would help with that
> sizing. This would reduce seeks involved in writing the WAL data.
> 
> The other thing it would do is reduce seeks and metadata writes involved in
> extending WAL files.
> 
> All of this is moot if the WAL doesn't have its own spindle(s).
> 
> This almost leads back to the old-fashioned idea of using a raw partition,
> to avoid the overhead of the OS and file structure. 
> 
> Or I could be thoroughly demonstrating my complete lack of understanding of
> PostgreSQL internals. :-)
> 
> Maybe I'll get a chance to try the flash drive WAL idea in the next couple
> of weeks. Need to see if the hardware guys have a spare flash drive I can
> abuse.
> 
> Paul
> 

Obviously, this whole process would be much more effective on systems with
separate WAL drives. But even on less busy systems, the lock-step of
write-a-WAL/wait-for-heads/write-a-WAL can dramatically decrease your
effective throughput to the drive. For example, the worst case would be
write one WAL block to disk. Then schedule another WAL block to be written
to disk. This block will need to wait for 1 full disk rotation to perform
the write. On a 10k drive, you will be able to log in this scenario 166
TPS assuming no piggy-backed syncs. Now look at the case where we can use
the preallocated WAL and write immediately. Assuming a 100% sequential disk
layout, if we can start writing within 25% of the full rotation we can now
support 664 TPS on the same hardware. Now look at a typical hard drive on
my desktop system with 150M sectors/4 heads/50000 tracks -> 3000 blocks/track
or 375 8K blocks. If we can write the next block within 10 8K blocks we can
perform 6225 TPS, within 5 8K blocks = 12450 TPS, within 2 8K blocks =
31125 TPS. This is just on a simple disk drive. As you can see, even small
improvements can make a tremendous difference in throughput. My analysis
is very simplistic and whether we can model the I/O quickly enough to be
useful is still to be determined. Maybe someone on the mailing list with
more experiance in how disk drives actually function can provide more
definitive information.

Ken