Re: Spread checkpoint sync - Mailing list pgsql-hackers
From | Cédric Villemain |
---|---|
Subject | Re: Spread checkpoint sync |
Date | |
Msg-id | AANLkTimvJP5SnS6u826BUt2=NmwM+43AQQgnnKwY_yWg@mail.gmail.com Whole thread Raw |
In response to | Re: Spread checkpoint sync (Greg Smith <greg@2ndquadrant.com>) |
Responses |
Re: Spread checkpoint sync
|
List | pgsql-hackers |
2011/2/7 Greg Smith <greg@2ndquadrant.com>: > Robert Haas wrote: >> >> With the fsync queue compaction patch applied, I think most of this is >> now not needed. Attached please find an attempt to isolate the >> portion that looks like it might still be useful. The basic idea of >> what remains here is to make the background writer still do its normal >> stuff even when it's checkpointing. In particular, with this patch >> applied, PG will: >> >> 1. Absorb fsync requests a lot more often during the sync phase. >> 2. Still try to run the cleaning scan during the sync phase. >> 3. Pause for 3 seconds after every fsync. >> > > Yes, the bits you extracted were the remaining useful parts from the > original patch. Today was quiet here because there were sports on or > something, and I added full auto-tuning magic to the attached update. I > need to kick off benchmarks and report back tomorrow to see how well this > does, but any additional patch here would only be code cleanup on the messy > stuff I did in here (plus proper implementation of the pair of GUCs). This > has finally gotten to the exact logic I've been meaning to complete as > spread sync since the idea was first postponed in 8.3, with the benefit of > some fsync aborption improvements along the way too > > The automatic timing is modeled on the existing checkpoint_completion_target > concept, except with a new tunable (not yet added as a GUC) currently called > CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is > make the existing checkpoint_completion_target now be the target for the end > of the sync phase, matching its name; people who bumped it up won't > necessarily even have to change anything. Then the new guc can be > checkpoint_write_target, representing the target that is in there right now. Is it worth a new thread with the different IO improvements done so far or on-going and how we may add new GUC(if required !!!) with intelligence between those patches ? ( For instance, hint bit IO limit needs probably a tunable to define something similar to hint_write_completion_target and/or IO_throttling strategy, ...items which are still in gestation...) > > I tossed the earlier idea of counting relations to sync based on the write > phase data as too inaccurate after testing, and with it for now goes > checkpoint sorting. Instead, I just take a first pass over pendingOpsTable > to get a total number of things to sync, which will always match the real > count barring strange circumstances (like dropping a table). > > As for the automatically determining the interval, I take the number of > syncs that have finished so far, divide by the total, and get a number > between 0.0 and 1.0 that represents progress on the sync phase. I then use > the same basic CheckpointWriteDelay logic that is there for spreading writes > out, except with the new sync target. I realized that if we assume the > checkpoint writes should have finished in CheckPointCompletionTarget worth > of time or segments, we can compute a new progress metric with the formula: > > progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * > finished / goal; > > Where "finished" is the number of segments written out, while "goal" is the > total. To turn this into an example, let's say the default parameters are > set, we've finished the writes, and finished 1 out of 4 syncs; that much > work will be considered: > > progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 > > On a scale that effectively aimes to be finished sync work by 0.8. > > I don't use quite the same logic as the CheckpointWriteDelay though. It > turns out the existing checkpoint_completion implementation doesn't always > work like I thought it did, which provide some very interesting insight into > why my attempts to work around checkpoint problems haven't worked as well as > expected the last few years. I thought that what it did was wait until an > amount of time determined by the target was reached until it did the next > write. That's not quite it; what it actually does is check progress against > the target, then sleep exactly one nap interval if it is is ahead of > schedule. That is only the same thing if you have a lot of buffers to write > relative to the amount of time involved. There's some alternative logic if > you don't have bgwriter_lru_maxpages set, but in the normal situation it > effectively it means that: > > maximum write spread time=bgwriter_delay * checkpoint dirty blocks > > No matter how far apart you try to spread the checkpoints. Now, typically, > when people run into these checkpoint spikes in production, reducing > shared_buffers improves that. But I now realize that doing so will then > reduce the average number of dirty blocks participating in the checkpoint, > and therefore potentially pull the spread down at the same time! Also, if > you try and tune bgwriter_delay down to get better background cleaning, > you're also reducing the maximum spread. Between this issue and the bad > behavior when the fsync queue fills, no wonder this has been so hard to tune > out of production systems. At some point, the reduction in spread defeats > further attempts to reduce the size of what's written at checkpoint time, by > lowering the amount of data involved. interesting! > > What I do instead is nap until just after the planned schedule, then execute > the sync. What ends up happening then is that there can be a long pause > between the end of the write phase and when syncs start to happen, which I > consider a good thing. Gives the kernel a little more time to try and get > writes moving out to disk. Sounds like a really good idea like that. > Here's what that looks like on my development > desktop: > > 2011-02-07 00:46:24 EST: LOG: checkpoint starting: time > 2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10 > 2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99 > 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1 > file=base/16736/16749.1 time=12033.898 msec > 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2 > file=base/16736/16749 time=60.799 msec > 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59 > 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3 > file=base/16736/16756 time=0.003 msec > 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4 > file=base/16736/16750 time=0.003 msec > 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5 > file=base/16736/16737 time=0.004 msec > 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6 > file=base/16736/16749_fsm time=0.004 msec > 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7 > file=base/16736/16740 time=0.003 msec > 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8 > file=base/16736/16749_vm time=0.003 msec > 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9 > file=base/16736/16752 time=0.003 msec > 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10 > file=base/16736/16754 time=0.003 msec > 2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers > (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled; > write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10, > longest=12.033 s, average=1.209 s > > Since this is ext3 the spike during the first sync is brutal, anyway, but it > tried very hard to avoid that: it waited 99 * 200ms = 19.8 seconds between > writing the last buffer and when it started syncing them (00:42:04 to > 00:48:24). Given the slow write for #1, it was then behind, so it > immediately moved onto #2. But after that, it was able to insert a moderate > nap time between successive syncs--60 naps is 12 seconds, and it keeps that > pace for the remainder of the sync. This is the same sort of thing I'd > worked out as optimal on the system this patch originated from, except it > had a lot more dirty relations; that's why its naptime was the 3 seconds > hard-coded into earlier versions of this patch. > > Results on XFS with mini-server class hardware should be interesting... > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
pgsql-hackers by date: