Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1506200817400.31742@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
Re: checkpointer continuous flushing |
List | pgsql-hackers |
Hello Andres, >>> - Move fsync as early as possible, suggested by Andres Freund? >>> >>> My opinion is that this should be left out for the nonce. > > "for the nonce" - what does that mean? Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ... {for the nonce}, i. e. for the present time. > I'm doubtful that it's a good idea to separate this out, if you did. Actually I did, because as explained in another mail the fsync time when the other options are activated as reported in the logs is essentially null, so it would not bring significant improvements on these runs, and also the patch changes enough things as it is. So this is an evidence-based decision. I also agree that it seems interesting on principle and should be beneficial in some case, but I would rather keep that on a TODO list together with trying to do better things in the bgwriter and try to focus on the current proposal which already changes significantly the checkpointer throttling logic. >> - as version 2: checkpoint buffer sorting based on a 2007 patch by >> Takahiro Itagaki but with a smaller and static buffer allocated once. >> Also, sorting is done by chunks of 131072 pages in the current version, >> with a guc to change this value. > > I think it's a really bad idea to do this in chunks. The small problem I see is that for a very large setting there could be several seconds or even minutes of sorting, which may or may not be desirable, so having some control on that seems a good idea. Another argument is that Tom said he wanted that:-) In practice the value can be set at a high value so that it is nearly always sorted in one go. Maybe value "0" could be made special and used to trigger this behavior systematically, and be the default. > That'll mean we'll frequently uselessly cause repetitive random IO, This is not an issue if the chunks are large enough, and anyway the guc allows to change the behavior as desired. As I said, keeping some control seems a good idea, and the "full sorting" can be made the default behavior. > often interleaved. That pattern is horrible for SSDs too. We should > always try to do this at once, and only fail back to using less memory > if we couldn't allocate everything. The memory is needed anyway in order to avoid a double or significantly more heavy implementation for the throttling loop. It is allocated once on the first checkpoint. The allocation could be moved to the checkpointer initialization if this is a concern. The memory needed is one int per buffer, which is smaller than the 2007 patch. >> . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s > > It'd be interesting to see numbers for tiny, without the overly small > checkpoint timeout value. 30s is below the OS's writeback time. The point of tiny was to trigger a lot of checkpoints. The size is pretty ridiculous anyway, as "tiny" implies. I think I did some tests on other versions of the patch and longer checkpoint_timeout on pretty small database that showed smaller benefit from the options, as one would expect. I'll try to re-run some. > So you've not run things at more serious concurrency, that'd be > interesting to see. I do not have a box available for "serious concurrency". > I'd also like to see concurrent workloads with synchronous_commit=off - > I've seen absolutely horrible latency behaviour for that, and I'm hoping > this will help. It's also a good way to simulate faster hardware than > you have. > It's also curious that sorting is detrimental for full speed 'tiny'. Yep. >> With SSD probably both options would probably have limited benefit. > > I doubt that. Small random writes have bad consequences for wear > leveling. You might not notice that with a short tests - again, I doubt > it - but it'll definitely become visible over time. Possibly. Testing such effects does not seem easy, though. At least I have not seen "write stalls" on SSD, which is my primary concern. -- Fabien.
pgsql-hackers by date: