Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1506220713150.16123@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
|
List | pgsql-hackers |
Hello Andres, >> So this is an evidence-based decision. > > Meh. You're testing on low concurrency. Well, I'm just testing on the available box. I do not see the link between high concurrency and whether moving fsync as early as possible would have a large performance impact. I think it might be interesting if bgwriter is doing a lot of writes, but I'm not sure under which configuration & load that would be. >>> I think it's a really bad idea to do this in chunks. >> >> The small problem I see is that for a very large setting there could be >> several seconds or even minutes of sorting, which may or may not be >> desirable, so having some control on that seems a good idea. > > If the sorting of the dirty blocks alone takes minutes, it'll never > finish writing that many buffers out. That's a utterly bogus argument. Well, if in the future you have 8 TB of memory (I've seen a 512GB memory server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken in the worst case you may have 256 millions 8k-buffers to checkpoint. Then it really depends on the I/O backend stuff used by the box, but if you bought 8 TB of RAM probably you would have a nice I/O stuff attached. >> Another argument is that Tom said he wanted that:-) > > I don't think he said that when we discussed this last. That is what I was recalling when I wrote this sentence: http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us But it had more to do with memory-allocation management. >> In practice the value can be set at a high value so that it is nearly always >> sorted in one go. Maybe value "0" could be made special and used to trigger >> this behavior systematically, and be the default. > > You're just making things too complicated. ISTM that it is not really complicated, but anyway it is easy to change the checkpoint_sort stuff to a boolean. In the reported performance tests, the is usually just one chunk anyway, sometimes two, so this gives an idea of the overall performance effect. >> This is not an issue if the chunks are large enough, and anyway the guc >> allows to change the behavior as desired. > > I don't think this is true. If two consecutive blocks are dirty, but you > sync them in two different chunks, you *always* will cause additional > random IO. I think that it could be a small number if the chunks are large, i.e. the performance benefit of sorting larger and larger chunks is decreasing. > Either the drive will have to skip the write for that block, > or the os will prefetch the data. More importantly with SSDs it voids > the wear leveling advantages. Possibly. I do not understand wear leveling done by SSD firmware. >>> often interleaved. That pattern is horrible for SSDs too. We should always >>> try to do this at once, and only fail back to using less memory if we >>> couldn't allocate everything. >> >> The memory is needed anyway in order to avoid a double or significantly more >> heavy implementation for the throttling loop. It is allocated once on the >> first checkpoint. The allocation could be moved to the checkpointer >> initialization if this is a concern. The memory needed is one int per >> buffer, which is smaller than the 2007 patch. > > There's a reason the 2007 patch (and my revision of it last year) did > what it did. You can't just access buffer descriptors without > locking. I really think that you can because the sorting is really "advisory", i.e. the checkpointer will work fine if the sorting is wrong or not done at all, as it is now, when the checkpointer writes buffers. The only condition is that the buffers must not be moved with their "to write in this checkpoint" flag, but this is also necessary for the current checkpointer stuff to work. Moreover, this trick is alreay pre-existing from the patch I submitted: some tests are done without locking, but the actual "buffer write" does the locking and would skip it if the previous test was wrong, as described in comments in the code. > Besides, causing additional cacheline bouncing during the > sorting process is a bad idea. Hmmm. The impact would be to multiply the memory required by 3 or 4 (buf_id, relation, forknum, offset), instead of just buf_id, and I understood that memory was a concern. Moreover, once the sort process get the lines which contain the sorting data from the buffer descriptor in its cache, I think that it should be pretty much okay. Incidentally, they would probably have been brought to cache by the scan to collect them. Also, I do not think that the sorting time for 128000 buffers, and possible cache misses, was a big issue, but I do not have a measure to defend that. I could try to collect some data about that. -- Fabien.
pgsql-hackers by date: