Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1601071613040.5278@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
|
List | pgsql-hackers |
Hello Andres, >> One of the point of aggregating flushes is that the range flush call cost >> is significant, as shown by preliminary tests I did, probably up in the >> thread, so it makes sense to limit this cost, hence the aggregation. These >> removed some performation regression I had in some cases. > > FWIW, my tests show that flushing for clean ranges is pretty cheap. Yes, I agree that it is quite cheap, but I had a few % tps regressions in some cases without aggregating, and aggregating was enough to avoid these small regressions. >> Also, the granularity of the buffer flush call is a file + offset + size, so >> necessarily it should be done this way (i.e. per file). > > What syscalls we issue, and at what level we track outstanding flushes, > doesn't have to be the same. Sure. But the current version is simple, efficient and proven by many runs, so there should be a very strong argument to justify a significant benefit to change the approach, and I see no such thing in your arguments. For me the current approach is optimal for the checkpointer, because it takes advantage of all available information to perform a better job. >> Once buffers are sorted per file and offset within file, then written >> buffers are as close as possible one after the other, the merging is very >> easy to compute (it is done on the fly, no need to keep the list of buffers >> for instance), it is optimally effective, and when the checkpointed file >> changes then we will never go back to it before the next checkpoint, so >> there is no reason not to flush right then. > > Well, that's true if there's only one tablespace, but e.g. not the case > with two tablespaces of about the same number of dirty buffers. ISTM that in the version of the patch I sent there was one flushing structure per tablespace each doing its own flushing on its files, so it should work the same, only the writing intensity is devided by the number of tablespace? Or am I missing something? >> So basically I do not see a clear positive advantage to your suggestion, >> especially when taking into consideration the scheduling process of the >> scheduler: > > I don't think it makes a big difference for the checkpointer alone, but > it makes the interface much more suitable for other processes, e.g. the > bgwriter, and normal backends. Hmmm. ISTM that the requirement are not exactly the same for the bgwriter and backends vs the checkpointer. The checkpointer has the advantage of being able to plan its IOs on the long term (volume & time is known...) and the implementation takes the full benefit of this planing by sorting and scheduling and flushing buffers so as to generate as much sequential writes as possible. The bgwriter and backends have a much shorter vision (a few seconds, or juste one query being process), so the solution will be less efficient and probably more messy on the coding side. This is life. I do not see why not to take the benefit of a full planing in the checkpointer just because other processes cannot do the same, especially as under plenty of loads the checkpointer does most of the writing so is the limiting factor. So I do not buy your suggestion for the checkpointer. Maybe it will be the way to go for bgwriter and backends, then fine for them. >>> Imo that means that we'd better track writes on a relfilenode + block >>> number level. >> >> I do not think that it is a better option. Moreover, the current approach >> has been proven to be very effective on hundreds of runs, so redoing it >> differently for the sake of it does not look like good resource allocation. > > For a subset of workloads, yes. Hmmm. What I understood is that the workloads that have some performance regressions (regressions that I have *not* seen in the many tests I ran) are not due to checkpointer IOs, but rather in settings where most of the writes is done by backends or bgwriter. I do not see the point of rewriting the checkpointer for them, although obviously I agree that something has to be done also for the other processes. Maybe if all the writes (bgwriter and checkpointer) where performed by the same process then some dynamic mixing and sorting and aggregating would make sense, but this is currently not the case, and would probably have quite limited effect. Basically I do not understand how changing the flushing organisation as you suggest would improve the checkpointer performance significantly, for me it should only degrade the performance compared to the current version, as far as the checkpointer is concerned. -- Fabien.
pgsql-hackers by date: