Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1506220713150.16123@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers
Hello Andres,

>> So this is an evidence-based decision.
>
> Meh. You're testing on low concurrency.

Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync as 
early as possible would have a large performance impact. I think it might 
be interesting if bgwriter is doing a lot of writes, but I'm not sure 
under which configuration & load that would be.

>>> I think it's a really bad idea to do this in chunks.
>>
>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> If the sorting of the dirty blocks alone takes minutes, it'll never
> finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memory 
server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken 
in the worst case you may have 256 millions 8k-buffers to checkpoint. Then 
it really depends on the I/O backend stuff used by the box, but if you 
bought 8 TB of RAM probably you would have a nice I/O stuff attached.

>> Another argument is that Tom said he wanted that:-)
>
> I don't think he said that when we discussed this last.

That is what I was recalling when I wrote this sentence:

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it had more to do with memory-allocation management.

>> In practice the value can be set at a high value so that it is nearly always
>> sorted in one go. Maybe value "0" could be made special and used to trigger
>> this behavior systematically, and be the default.
>
> You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to change 
the checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway, 
sometimes two, so this gives an idea of the overall performance effect.

>> This is not an issue if the chunks are large enough, and anyway the guc
>> allows to change the behavior as desired.
>
> I don't think this is true. If two consecutive blocks are dirty, but you
> sync them in two different chunks, you *always* will cause additional
> random IO.

I think that it could be a small number if the chunks are large, i.e. the 
performance benefit of sorting larger and larger chunks is decreasing.

> Either the drive will have to skip the write for that block,
> or the os will prefetch the data. More importantly with SSDs it voids
> the wear leveling advantages.

Possibly. I do not understand wear leveling done by SSD firmware.

>>> often interleaved. That pattern is horrible for SSDs too. We should always
>>> try to do this at once, and only fail back to using less memory if we
>>> couldn't allocate everything.
>>
>> The memory is needed anyway in order to avoid a double or significantly more
>> heavy implementation for the throttling loop. It is allocated once on the
>> first checkpoint. The allocation could be moved to the checkpointer
>> initialization if this is a concern. The memory needed is one int per
>> buffer, which is smaller than the 2007 patch.
>
> There's a reason the 2007 patch (and my revision of it last year) did
> what it did. You can't just access buffer descriptors without
> locking.

I really think that you can because the sorting is really "advisory", i.e. 
the checkpointer will work fine if the sorting is wrong or not done at 
all, as it is now, when the checkpointer writes buffers. The only 
condition is that the buffers must not be moved with their "to write in 
this checkpoint" flag, but this is also necessary for the current 
checkpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted: 
some tests are done without locking, but the actual "buffer write" does 
the locking and would skip it if the previous test was wrong, as described 
in comments in the code.

> Besides, causing additional cacheline bouncing during the
> sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4 
(buf_id, relation, forknum, offset), instead of just buf_id, and I 
understood that memory was a concern.

Moreover, once the sort process get the lines which contain the sorting 
data from the buffer descriptor in its cache, I think that it should be 
pretty much okay. Incidentally, they would probably have been brought to 
cache by the scan to collect them. Also, I do not think that the sorting 
time for 128000 buffers, and possible cache misses, was a big issue, but I 
do not have a measure to defend that. I could try to collect some data 
about that.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Next
From: Pavel Stehule
Date:
Subject: user space function "is_power_user"