Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Cédric Villemain
Subject Re: Spread checkpoint sync
Date
Msg-id AANLkTimvJP5SnS6u826BUt2=NmwM+43AQQgnnKwY_yWg@mail.gmail.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: Spread checkpoint sync
List pgsql-hackers
2011/2/7 Greg Smith <greg@2ndquadrant.com>:
> Robert Haas wrote:
>>
>> With the fsync queue compaction patch applied, I think most of this is
>> now not needed.  Attached please find an attempt to isolate the
>> portion that looks like it might still be useful.  The basic idea of
>> what remains here is to make the background writer still do its normal
>> stuff even when it's checkpointing.  In particular, with this patch
>> applied, PG will:
>>
>> 1. Absorb fsync requests a lot more often during the sync phase.
>> 2. Still try to run the cleaning scan during the sync phase.
>> 3. Pause for 3 seconds after every fsync.
>>
>
> Yes, the bits you extracted were the remaining useful parts from the
> original patch.  Today was quiet here because there were sports on or
> something, and I added full auto-tuning magic to the attached update.  I
> need to kick off benchmarks and report back tomorrow to see how well this
> does, but any additional patch here would only be code cleanup on the messy
> stuff I did in here (plus proper implementation of the pair of GUCs).  This
> has finally gotten to the exact logic I've been meaning to complete as
> spread sync since the idea was first postponed in 8.3, with the benefit of
> some fsync aborption improvements along the way too
>
> The automatic timing is modeled on the existing checkpoint_completion_target
> concept, except with a new tunable (not yet added as a GUC) currently called
> CheckPointSyncTarget, set to 0.8 right now.  What I think I want to do is
> make the existing checkpoint_completion_target now be the target for the end
> of the sync phase, matching its name; people who bumped it up won't
> necessarily even have to change anything.  Then the new guc can be
> checkpoint_write_target, representing the target that is in there right now.

Is it worth a new thread with the different IO improvements done so
far or on-going and how we may add new GUC(if required !!!) with
intelligence between those patches ? ( For instance, hint bit IO limit
needs probably a tunable to define something similar to
hint_write_completion_target and/or IO_throttling strategy, ...items
which are still in gestation...)

>
> I tossed the earlier idea of counting relations to sync based on the write
> phase data as too inaccurate after testing, and with it for now goes
> checkpoint sorting.  Instead, I just take a first pass over pendingOpsTable
> to get a total number of things to sync, which will always match the real
> count barring strange circumstances (like dropping a table).
>
> As for the automatically determining the interval, I take the number of
> syncs that have finished so far, divide by the total, and get a number
> between 0.0 and 1.0 that represents progress on the sync phase.  I then use
> the same basic CheckpointWriteDelay logic that is there for spreading writes
> out, except with the new sync target.  I realized that if we assume the
> checkpoint writes should have finished in CheckPointCompletionTarget worth
> of time or segments, we can compute a new progress metric with the formula:
>
> progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) *
> finished / goal;
>
> Where "finished" is the number of segments written out, while "goal" is the
> total.  To turn this into an example, let's say the default parameters are
> set, we've finished the writes, and  finished 1 out of 4 syncs; that much
> work will be considered:
>
> progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625
>
> On a scale that effectively aimes to be finished sync work by 0.8.
>
> I don't use quite the same logic as the CheckpointWriteDelay though.  It
> turns out the existing checkpoint_completion implementation doesn't always
> work like I thought it did, which provide some very interesting insight into
> why my attempts to work around checkpoint problems haven't worked as well as
> expected the last few years.  I thought that what it did was wait until an
> amount of time determined by the target was reached until it did the next
> write.  That's not quite it; what it actually does is check progress against
> the target, then sleep exactly one nap interval if it is is ahead of
> schedule.  That is only the same thing if you have a lot of buffers to write
> relative to the amount of time involved.  There's some alternative logic if
> you don't have bgwriter_lru_maxpages set, but in the normal situation it
> effectively it means that:
>
> maximum write spread time=bgwriter_delay * checkpoint dirty blocks
>
> No matter how far apart you try to spread the checkpoints.  Now, typically,
> when people run into these checkpoint spikes in production, reducing
> shared_buffers improves that.  But I now realize that doing so will then
> reduce the average number of dirty blocks participating in the checkpoint,
> and therefore potentially pull the spread down at the same time!  Also, if
> you try and tune bgwriter_delay down to get better background cleaning,
> you're also reducing the maximum spread.  Between this issue and the bad
> behavior when the fsync queue fills, no wonder this has been so hard to tune
> out of production systems.  At some point, the reduction in spread defeats
> further attempts to reduce the size of what's written at checkpoint time, by
> lowering the amount of data involved.

interesting!

>
> What I do instead is nap until just after the planned schedule, then execute
> the sync.  What ends up happening then is that there can be a long pause
> between the end of the write phase and when syncs start to happen, which I
> consider a good thing.  Gives the kernel a little more time to try and get
> writes moving out to disk.

Sounds like a really good idea like that.

> Here's what that looks like on my development
> desktop:
>
> 2011-02-07 00:46:24 EST: LOG:  checkpoint starting: time
> 2011-02-07 00:48:04 EST: DEBUG:  checkpoint sync:  estimated segments=10
> 2011-02-07 00:48:24 EST: DEBUG:  checkpoint sync: naps=99
> 2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=1
> file=base/16736/16749.1 time=12033.898 msec
> 2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=2
> file=base/16736/16749 time=60.799 msec
> 2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: naps=59
> 2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: number=3
> file=base/16736/16756 time=0.003 msec
> 2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: number=4
> file=base/16736/16750 time=0.003 msec
> 2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: number=5
> file=base/16736/16737 time=0.004 msec
> 2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: number=6
> file=base/16736/16749_fsm time=0.004 msec
> 2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: number=7
> file=base/16736/16740 time=0.003 msec
> 2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: number=8
> file=base/16736/16749_vm time=0.003 msec
> 2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: number=9
> file=base/16736/16752 time=0.003 msec
> 2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: number=10
> file=base/16736/16754 time=0.003 msec
> 2011-02-07 00:50:12 EST: LOG:  checkpoint complete: wrote 14335 buffers
> (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
> write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
> longest=12.033 s, average=1.209 s
>
> Since this is ext3 the spike during the first sync is brutal, anyway, but it
> tried very hard to avoid that:  it waited 99 * 200ms = 19.8 seconds between
> writing the last buffer and when it started syncing them (00:42:04 to
> 00:48:24).  Given the slow write for #1, it was then behind, so it
> immediately moved onto #2.  But after that, it was able to insert a moderate
> nap time between successive syncs--60 naps is 12 seconds, and it keeps that
> pace for the remainder of the sync.  This is the same sort of thing I'd
> worked out as optimal on the system this patch originated from, except it
> had a lot more dirty relations; that's why its naptime was the 3 seconds
> hard-coded into earlier versions of this patch.
>
> Results on XFS with mini-server class hardware should be interesting...
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: A different approach to extension NO USER DATA feature
Next
From: Shigeru HANADA
Date:
Subject: Re: SQL/MED - file_fdw