Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id 51C9D03B.10107@vmware.com
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
On 21.06.2013 11:29, KONDO Mitsumasa wrote:
> I took results of my separate patches and original PG.
>
> * Result of DBT-2
> | TPS 90%tile Average Maximum
> ------------------------------------------------------
> original_0.7 | 3474.62 18.348328 5.739 36.977713
> original_1.0 | 3469.03 18.637865 5.842 41.754421
> fsync | 3525.03 13.872711 5.382 28.062947
> write | 3465.96 19.653667 5.804 40.664066
> fsync + write | 3564.94 16.31922 5.1 34.530766
>
> - 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
> - In other patched postgres, checkpoint_completion_target sets 0.7.
> - 'write' is applied write patch, and 'fsync' is applied fsync patch.
> - 'fsync + write' is applied both patches.
>
>
> * Investigation of result
> - Large value of checkpoint_completion_target in original and the patch
> in write become slow latency in benchmark transactions. Because slow
> write pages are caused long time fsync IO in final checkpoint.
> - The patch in fsync has an effect latency in each file fsync. Continued
> fsyncsin each files are caused slow latency. Therefore, it is good for
> latency that fsync stage in checkpoint has sleeping time after slow
> fsync IO.
> - The patches of fsync + write were seemed to improve TPS. I think that
> write patch does not disturb transactions which are in full-page-write
> WAL write than original(plain) PG.

Hmm, so the write patch doesn't do much, but the fsync patch makes the
response times somewhat smoother. I'd suggest that we drop the write
patch for now, and focus on the fsyncs.

What checkpointer_fsync_delay_ratio and
checkpointer_fsync_delay_threshold settings did you use with the fsync
patch? It's disabled by default.

This is the interesting part of the patch:

> @@ -1171,6 +1174,20 @@ mdsync(void)
>                                                                  FilePathName(seg->mdfd_vfd),
>                                                                  (double) elapsed / 1000);
>
> +                                               /*
> +                                                * If this fsync has long time, we sleep 'fsync-time *
checkpoint_fsync_delay_ratio'
> +                                                * for giving priority to executing transaction.
> +                                                */
> +                                               if( CheckPointerFsyncDelayThreshold >= 0 &&
> +                                                       !shutdown_requested &&
> +                                                       !ImmediateCheckpointRequested() &&
> +                                                       (elapsed / 1000 > CheckPointerFsyncDelayThreshold))
> +                                               {
> +                                                       pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio *
1000L);
> +                                                       if(log_checkpoints)
> +                                                               elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
> +                                                                       (double) (elapsed / 1000) *
CheckPointerFsyncDelayRatio);
> +                                               }
>                                                 break;  /* out of retry loop */
>                                         }

I'm not sure it's a good idea to sleep proportionally to the time it
took to complete the previous fsync. If you have a 1GB cache in the RAID
controller, fsyncing the a 1GB segment will fill it up. But since it
fits in cache, it will return immediately. So we proceed fsyncing other
files, until the cache is full and the fsync blocks. But once we fill up
the cache, it's likely that we're hurting concurrent queries. ISTM it
would be better to stay under that threshold, keeping the I/O system
busy, but never fill up the cache completely.

This is just a theory, though. I don't have a good grasp on how the OS
and a typical RAID controller behaves under these conditions.

I'd suggest that we just sleep for a small fixed amount of time between
every fsync, unless we're running behind the checkpoint schedule. And
for a first approximation, let's just assume that the fsync phase is e.g
10% of the whole checkpoint work.

> I will send you more detail investigation and result next week. And I
> will also take result in pgbench. If you mind other part of benchmark
> result or parameter of postgres, please tell me.

Attached is a quick patch to implement a fixed, 100ms delay between
fsyncs, and the assumption that fsync phase is 10% of the total
checkpoint duration. I suspect 100ms is too small to have much effect,
but that happens to be what we have currently in CheckpointWriteDelay().
Could you test this patch along with yours? If you can test with
different delays (e.g 100ms, 500ms and 1000ms) and different ratios
between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of
how sensitive the test case is to those settings.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: pg_filedump 9.3: checksums (and a few other fixes)
Next
From: Josh Berkus
Date:
Subject: Kudos for Reviewers -- straw poll