Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Date | |
Msg-id | 51C9D03B.10107@vmware.com Whole thread Raw |
In response to | Re: Improvement of checkpoint IO scheduler for stable transaction responses (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>) |
Responses |
Re: Improvement of checkpoint IO scheduler for stable
transaction responses
|
List | pgsql-hackers |
On 21.06.2013 11:29, KONDO Mitsumasa wrote: > I took results of my separate patches and original PG. > > * Result of DBT-2 > | TPS 90%tile Average Maximum > ------------------------------------------------------ > original_0.7 | 3474.62 18.348328 5.739 36.977713 > original_1.0 | 3469.03 18.637865 5.842 41.754421 > fsync | 3525.03 13.872711 5.382 28.062947 > write | 3465.96 19.653667 5.804 40.664066 > fsync + write | 3564.94 16.31922 5.1 34.530766 > > - 'original_*' indicates checkpoint_completion_target in PG 9.2.4. > - In other patched postgres, checkpoint_completion_target sets 0.7. > - 'write' is applied write patch, and 'fsync' is applied fsync patch. > - 'fsync + write' is applied both patches. > > > * Investigation of result > - Large value of checkpoint_completion_target in original and the patch > in write become slow latency in benchmark transactions. Because slow > write pages are caused long time fsync IO in final checkpoint. > - The patch in fsync has an effect latency in each file fsync. Continued > fsyncsin each files are caused slow latency. Therefore, it is good for > latency that fsync stage in checkpoint has sleeping time after slow > fsync IO. > - The patches of fsync + write were seemed to improve TPS. I think that > write patch does not disturb transactions which are in full-page-write > WAL write than original(plain) PG. Hmm, so the write patch doesn't do much, but the fsync patch makes the response times somewhat smoother. I'd suggest that we drop the write patch for now, and focus on the fsyncs. What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold settings did you use with the fsync patch? It's disabled by default. This is the interesting part of the patch: > @@ -1171,6 +1174,20 @@ mdsync(void) > FilePathName(seg->mdfd_vfd), > (double) elapsed / 1000); > > + /* > + * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' > + * for giving priority to executing transaction. > + */ > + if( CheckPointerFsyncDelayThreshold >= 0 && > + !shutdown_requested && > + !ImmediateCheckpointRequested() && > + (elapsed / 1000 > CheckPointerFsyncDelayThreshold)) > + { > + pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L); > + if(log_checkpoints) > + elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec", > + (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio); > + } > break; /* out of retry loop */ > } I'm not sure it's a good idea to sleep proportionally to the time it took to complete the previous fsync. If you have a 1GB cache in the RAID controller, fsyncing the a 1GB segment will fill it up. But since it fits in cache, it will return immediately. So we proceed fsyncing other files, until the cache is full and the fsync blocks. But once we fill up the cache, it's likely that we're hurting concurrent queries. ISTM it would be better to stay under that threshold, keeping the I/O system busy, but never fill up the cache completely. This is just a theory, though. I don't have a good grasp on how the OS and a typical RAID controller behaves under these conditions. I'd suggest that we just sleep for a small fixed amount of time between every fsync, unless we're running behind the checkpoint schedule. And for a first approximation, let's just assume that the fsync phase is e.g 10% of the whole checkpoint work. > I will send you more detail investigation and result next week. And I > will also take result in pgbench. If you mind other part of benchmark > result or parameter of postgres, please tell me. Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms is too small to have much effect, but that happens to be what we have currently in CheckpointWriteDelay(). Could you test this patch along with yours? If you can test with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the test case is to those settings. - Heikki
Attachment
pgsql-hackers by date: