Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1510210727580.11852@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hello Andres,

>>> In my performance testing it showed that calling PerformFileFlush() only
>>> at segment boundaries and in CheckpointWriteDelay() can lead to rather
>>> spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
>>> problematic because it only is triggered while on schedule, and not when
>>> behind.
>>
>> When behind, the PerformFileFlush should be called on segment
>> boundaries.
>
> That means it's flushing up to a gigabyte of data at once. Far too
> much.

Hmmm. I do not get it. There would not be gigabytes, there would be as 
much as was written since the last sleep, about 100 ms ago, which is not 
likely to be gigabytes?

> The implementation pretty always will go behind schedule for some
> time. Since sync_file_range() doesn't flush in the foreground I don't
> think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that 
is the point of the call before sleeping.

>>> My testing seems to show that just adding a limit of 32 buffers to
>>> FileAsynchronousFlush() leads to markedly better results.
>>
>> Hmmm. 32 buffers means 256 KB, which is quite small.
>
> Why?

Because the point of sorting is to generate sequential writes so that the 
HDD has a lot of aligned stuff to write without moving the head, and 32 is 
rather small for that.

> The aim is to not overwhelm the request queue - which is where the
> coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so 
I'd rather have 128? Also, it can be changed, so maybe it should really be 
a guc?

> If you flush much more sync_file_range starts to do work in the 
> foreground.

Argh, too bad. I would have hoped that the would just deal with in an 
asynchronous way, this is not a "fsync" call, just a flush advise.

>>> I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
>>> sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
>>> might even be possible to later approximate that on windows using
>>> FlushViewOfFile().
>>
>> I'm not sure that mmap/msync can be used for this purpose, because there is
>> no real control it seems about where the file is mmapped.
>
> I'm not following? Why does it matter where a file is mapped?

Because it should be in shared buffers where pg needs it? You probably 
should not want to mmap all pg data files in user space for a large 
database? Or if so, currently the OS keeps the data in memory if it has 
enough space, but if you got to mmap this cache management would be pg 
responsability, if I understand correctly mmap and your intentions.

> I have had a friend (Christian Kruse, thanks!)  confirm that at least on
> OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
> that should be the case on freebsd too.

Good. My concern is how mmap could be used, though, not the flushing part.

>> Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
>> set in most case is an improvement.
>
> Yes, it'll not matter that much in many cases. But I rather disliked the
> NextBufferToWrite() implementation, especially that it walkes the array
> multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the 
sorting, not necessarily for a binary heap.

>> I also noted this point, but I'm not sure how to have a better approach, so
>> I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
>> effect on performance for the test I ran then. The point of having not too
>> small a value is that it provide some significant work to the IO subsystem
>> without overflowing it.
>
> I don't think that makes much sense. All a longer sleep achieves is
> creating a larger burst of writes afterwards. We should really sleep
> adaptively.

It sounds reasonable, but what would be the criterion?

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Kouhei Kaigai
Date:
Subject: Re: Foreign join pushdown vs EvalPlanQual
Next
From: Amit Langote
Date:
Subject: ATT_FOREIGN_TABLE and ATWrongRelkindError()