Re: 8.0.X and the ARC patent - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: 8.0.X and the ARC patent |
Date | |
Msg-id | 1109964746.6117.92.camel@localhost.localdomain Whole thread Raw |
In response to | Re: 8.0.X and the ARC patent (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: 8.0.X and the ARC patent
Re: 8.0.X and the ARC patent |
List | pgsql-hackers |
On Wed, 2005-03-02 at 20:55 -0500, Tom Lane wrote: > Michael Adler <adler@pobox.com> writes: > > Looking at the "Response Time Charts" > > > 8.0.1/ARC > > http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/rt.html > > > 20050301 with 2Q patch > > http://www.osdl.org/projects/dbt2dev/results/dev4-010/313/rt.html > > > It seems like the average response time has gone down, but the worse > > case ceiling has raised about 35%. > > The worst cases are associated with checkpoints. I'm not sure why a > checkpoint would have a greater effect on the 2Q system than an ARC > system --- checkpoint doesn't request any new buffers so you'd think > it'd be independent. Maybe this says that the bgwriter is less > effective with 2Q, so that there are more dirty buffers remaining to > be written at the checkpoint? But why? The pattern seems familiar. Reduced average response time increases total throughput, which on this test means we have more dirty buffers to write at checkpoint time. I would not neccessarily suspect 2Q over ARC, at least initially. The pattern of behaviour is similar across ARC, 2Q and Clock, though the checkpoint points differ in intensity. The latter makes me suspect BufMgrLock contention or similar. There is a two-level effect at Checkpoint time...first we have the write from PostgreSQL buffers to OS cache, then we have the write from OS cache to disk by the pdflush daemon. At this point, I'm not certain whether the delay is caused by the checkpointing or the pdflush daemons. Mark and I had discussed some investigations around that. This behaviour is new in the 2.6 kernel, so it is possible there is an unpleasant interaction there, though I do not wish to cast random blame. Checkpoint doesn't request new buffers, but it does require the BufMgrLock in order to write all of the dirty buffers. It could be that the I/Os map direct to OS cache, so that the tight loop to write out dirty buffers causes such an extreme backlog for the BufMgrLock that it takes more than a minute to clear and return to normal contention. It could be that at checkpoint time, the number of writes exceeds the dirty_ratio and the kernel forces the checkpoint process to bypass the cache and pdflush daemons altogether, and performing the I/O itself. Single-threaded, this would display the scalability profile we see. Some kernel level questions in there... There is no documented event-state model for LWlock acquisition, so it might be possible that there is a complex bottleneck in amongst them. Amdahl's Law tells me that looking at the checkpoints is the next best action for tuning, since they add considerably to the average response time. Looking at the oprofile for the run as a whole is missing out the delayed transaction behaviour that occurs during checkpoints. I would like to try and catch an oprofile of the system while performing a checkpoint, as a way to give us some clues. Perhaps that could be achieved by forcing a manual checkpoint as superuser, and making that interaction cause a switch to a new oprofile output file. Best Regards, Simon Riggs
pgsql-hackers by date: