Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Analysis of ganged WAL writes |
Date | |
Msg-id | 1034017213.2562.45.camel@rh72.home.ee Whole thread Raw |
In response to | Re: Analysis of ganged WAL writes (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Tue, 2002-10-08 at 01:27, Tom Lane wrote: > > The scheme we now have (with my recent patch) essentially says that the > commit delay seen by any one transaction is at most two disk rotations. > Unfortunately it's also at least one rotation :-(, except in the case > where there is no contention, ie, no already-scheduled WAL write when > the transaction reaches the commit stage. It would be nice to be able > to say "at most one disk rotation" instead --- but I don't see how to > do that in the absence of detailed information about disk head position. > > Something I was toying with this afternoon: assume we have a background > process responsible for all WAL writes --- not only filled buffers, but > the currently active buffer. It periodically checks to see if there > are unwritten commit records in the active buffer, and if so schedules > a write for them. If this could be done during each disk rotation, > "just before" the disk reaches the active WAL log block, we'd have an > ideal solution. And it would not be too hard for such a process to > determine the right time: it could measure the drive rotational speed > by observing the completion times of successive writes to the same > sector, and it wouldn't take much logic to empirically find the latest > time at which a write can be issued and have a good probability of > hitting the disk on time. (At least, this would work pretty well given > a dedicated WAL drive, else there'd be too much interference from other > I/O requests.) > > However, this whole scheme falls down on the same problem we've run into > before: user processes can't schedule themselves with millisecond > accuracy. The writer process might be able to determine the ideal time > to wake up and make the check, but it can't get the Unix kernel to > dispatch it then, at least not on most Unixen. The typical scheduling > slop is one time slice, which is comparable to if not more than the > disk rotation time. Standard for Linux has been 100Hz time slice, but it is configurable for some time. The latest RedHat (8.0) is built with 500Hz that makes about 4 slices/rev for 7200 rpm disks (2 for 15000rpm) > ISTM aio_write only improves the picture if there's some magic in-kernel > processing that makes this same kind of judgment as to when to issue the > "ganged" write for real, and is able to do it on time because it's in > the kernel. I haven't heard anything to make me think that that feature > actually exists. AFAIK the kernel isn't much more enlightened about > physical head positions than we are. At least for open source kernels it could be possible to 1. write a patch to kernel or 2. get the authors of kernel aio interested in doing it. or 3. the third possibility would be using some real-time (RT) OS or mixed RT/conventional OS where some threads can be scheduled for hard-RT . In an RT os you are supposed to be able to do exactly what you describe. I think that 2 and 3 could be "outsourced" (the respective developers talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably also inetersted in high-profile applications so they could boast that "using our stuff enabled PostgreSQL database run twice as fast". Anyway, getting to near-harware speeds for database will need more specific support from OS than web browsing or compiling. --------------- Hannu
pgsql-hackers by date: