Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? |
Date | |
Msg-id | alpine.DEB.2.10.1512231554490.22350@sto Whole thread Raw |
In response to | Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Let PostgreSQL's On Schedule checkpoint write buffer
smooth spread cycle by tuning IsCheckpointOnSchedule?
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? |
List | pgsql-hackers |
Hello Robert, > On a pgbench test, and probably many other workloads, the impact of > FPWs declines exponentially (or maybe geometrically, but I think > exponentially) as we get further into the checkpoint. Indeed. If the probability of hitting a page is uniform, I think that the FPW probability is exp(-n/N) for the n-th page access. > The first write is dead certain to need an FPW; after that, if access is > more or less random, the chance of needing an FPW for the next write > increases in proportion to the number of FPWs already written. As the > chances of NOT needing an FPW grow higher, the tps rate starts to > increase, initially just a bit, but then faster and faster as the > percentage of the working set that has already had an FPW grows. If the > working set is large, we're still doing FPWs pretty frequently when the > next checkpoint hits - if it's small, then it'll tail off sooner. Yes. >> My actual point is that it should be tested with different and especially >> smaller values, because 1.5 changes the overall load distribution *a lot*. >> For testing purpose I suggested that a guc would help, but the patch author >> has never been back to intervene on the thread, discuss the arguments not >> provide another patch. > > Well, somebody else should be able to hack a GUC into the patch. Yep. But I'm so far behind everything that I was basically waiting for the author to do it:-) > I think one thing that this conversation exposes is that the size of > the working set matters a lot. For example, if the workload is > pgbench, you're going to see a relatively short FPW-related spike at > scale factor 100, but at scale factor 3000 it's going to be longer and > at some larger scale factor it will be longer still. Therefore you're > probably right that 1.5 is unlikely to be optimal for everyone. > > Another point (which Jan Wieck made me think of) is that the optimal > behavior here likely depends on whether xlog and data are on the same > disk controller. If they aren't, the FPW spike and background writes > may not interact as much. Yep, I pointed out that as well. In which case the patch just disrupts the checkpoint load for no benefit... Which would make a guc mandatory. >> [...]. I think that it make sense for xlog triggered checkpoints, but >> less so with time triggered checkpoints. I may be wrong, but I think >> that this deserve careful analysis. > > Hmm, off-hand I don't see why that should make any difference. No > matter what triggers the checkpoint, there is going to be a spike of > FPI activity at the beginning. Hmmm. Let us try with both hands: AFAICR with xlog-triggered checkpoints, the checkpointer progress is measured with respect to the size of the WAL file, which does not grow linearly in time for the reason you pointed above (a lot of FPW at the beginning, less in the end). As the WAL file is growing quickly, the checkpointer thinks that it is late and that it has some catchup to do, so it will start to try writing quickly as well. There is a double whammy as both are writing more, and are probably not succeeding. For time triggered checkpoints, the WAL file gets filled up *but* the checkpointer load is balanced against time. This is a "simple" whammy, where the checkpointer uses IO bandwith which is needed for the WAL, and it could wait a little bit because the WAL will need less later, but it is not trying to catch up by even writing more, so the load shifting needed in this case is not the same as the previous case. As you point out there is a WAL spike in both case, but in one case there is also a checkpointer spike and in the other the checkpointer load is flat. So I think that the correction should not be the same in both cases. Moreover no correction is needed if WAL & relations are on different disks. Also, as you pointed out, it also depends on the load (for a large base the FPW is spead more evenly, for smaller bases there is a spike), so the corrective formula should take that information into account, which means that some evaluation of the FPW distribution should be collected... All this is non trivial. I may do some math to try to solve this, but I'm pretty sure that a blank 1.5 correction in all cases is not the solution. -- Fabien.
pgsql-hackers by date: