Re: [Testperf-general] Re: ExclusiveLock - Mailing list pgsql-hackers
From | Kenneth Marshall |
---|---|
Subject | Re: [Testperf-general] Re: ExclusiveLock |
Date | |
Msg-id | 20041124164353.GA895@it.is.rice.edu Whole thread Raw |
In response to | Re: [Testperf-general] Re: ExclusiveLock ("Bort, Paul" <pbort@tmwsystems.com>) |
List | pgsql-hackers |
On Wed, Nov 24, 2004 at 11:00:30AM -0500, Bort, Paul wrote: > > From: Kenneth Marshall [mailto:ktm@is.rice.edu] > [snip] > > The simplest idea I had was to pre-layout the WAL logs in a > > contiguous fashion > > on the disk. Solaris has this ability given appropriate FS > > parameters and we > > should be able to get close on most other OSes. Once that has > > happened, use > > something like the FSM map to show the allocated blocks. The > > CPU can keep track > > of its current disk rotational position (approx. is okay) > > then when we need to > > write a WAL block start writing at the next area that the > > disk head will be > > sweeping. Give it a little leaway for latency in the system > > and we should be > > able to get very low latency for the writes. Obviously, there > > would be wasted > > space but you could intersperse writes to the granularity of > > space overhead > > that you would like to see. As far as implementation, I was reading an > > interesting article that used a simple theoretical model to > > estimate disk head > > position to avoid latency. > > > > Ken, > > That's a neat idea, but I'm not sure how much good it will do. As bad as > rotational latency is, seek time is worse. Pre-allocation isn't going to do > much for rotational latency if the heads also have to seek back to the WAL. > > OTOH, pre-allocation could help two other performance aspects of the WAL: > First, if the WAL was pre-allocated, steps could be taken (by the operator, > based on their OS) to make the space allocated to the WAL contiguous. > Statistics on how much WAL is needed in 24 hours would help with that > sizing. This would reduce seeks involved in writing the WAL data. > > The other thing it would do is reduce seeks and metadata writes involved in > extending WAL files. > > All of this is moot if the WAL doesn't have its own spindle(s). > > This almost leads back to the old-fashioned idea of using a raw partition, > to avoid the overhead of the OS and file structure. > > Or I could be thoroughly demonstrating my complete lack of understanding of > PostgreSQL internals. :-) > > Maybe I'll get a chance to try the flash drive WAL idea in the next couple > of weeks. Need to see if the hardware guys have a spare flash drive I can > abuse. > > Paul > Obviously, this whole process would be much more effective on systems with separate WAL drives. But even on less busy systems, the lock-step of write-a-WAL/wait-for-heads/write-a-WAL can dramatically decrease your effective throughput to the drive. For example, the worst case would be write one WAL block to disk. Then schedule another WAL block to be written to disk. This block will need to wait for 1 full disk rotation to perform the write. On a 10k drive, you will be able to log in this scenario 166 TPS assuming no piggy-backed syncs. Now look at the case where we can use the preallocated WAL and write immediately. Assuming a 100% sequential disk layout, if we can start writing within 25% of the full rotation we can now support 664 TPS on the same hardware. Now look at a typical hard drive on my desktop system with 150M sectors/4 heads/50000 tracks -> 3000 blocks/track or 375 8K blocks. If we can write the next block within 10 8K blocks we can perform 6225 TPS, within 5 8K blocks = 12450 TPS, within 2 8K blocks = 31125 TPS. This is just on a simple disk drive. As you can see, even small improvements can make a tremendous difference in throughput. My analysis is very simplistic and whether we can model the I/O quickly enough to be useful is still to be determined. Maybe someone on the mailing list with more experiance in how disk drives actually function can provide more definitive information. Ken
pgsql-hackers by date: