Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1510221717000.22122@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hello Andres,

>> there would be as much as was written since the last sleep, about 100
>> ms ago, which is not likely to be gigabytes?
>
> In many cases we don't sleep all that frequently - after one 100ms sleep
> we're already behind a lot.

I think that "being behind" is not a problem as such, it is really the way 
the scheduler has been designed and works, by keeping pace with time & 
wall progress by little bursts of writes. If you reduce the sleep time a 
lot then it would end up having writes interleaved with small sleeps, but 
then this would be bad for performance has the OS would loose the ability 
to write much data sequentially on the disk.

It does not mean that the default 100 ms is a good figure, but the "being 
behind" is a feature, not an issue as such.

> And even so, it's pretty easy to get into checkpoint scenarios with ~500 
> mbyte/s as a writeout rate.

Hmmmm. Not with my hardware:-)

> Only issuing a sync_file_range() 10 times for that is obviously 
> problematic.

Hmmm. Then it should depend on the expected write capacity of the 
underlying disks...

>>> The implementation pretty always will go behind schedule for some
>>> time. Since sync_file_range() doesn't flush in the foreground I don't
>>> think it's important to do the flushing in concert with sleeping.
>>
>> For me it is important to avoid accumulating too large flushes, and that is
>> the point of the call before sleeping.
>
> I don't follow this argument. It's important to avoid large flushes,
> therefore we potentially allow large flushes to accumulate?

On my simple test hardware the flushes are not large, I think, so the 
problem does not arise. Maybe I should check.

>>>>> My testing seems to show that just adding a limit of 32 buffers to
>>>>> FileAsynchronousFlush() leads to markedly better results.
>>>>
>>>> Hmmm. 32 buffers means 256 KB, which is quite small.
>>>
>>> Why?
>>
>> Because the point of sorting is to generate sequential writes so that the
>> HDD has a lot of aligned stuff to write without moving the head, and 32 is
>> rather small for that.
>
> A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
> data back. It just puts it into the write queue.

Yes.

> You can have merging between IOs from either side.  But more importantly 
> you can't merge that many requests together anyway.

Probably.

>>> The aim is to not overwhelm the request queue - which is where the
>>> coalescing is done. And usually that's rather small.
>>
>> That is an argument. How small, though? It seems to be 128 by default, so
>> I'd rather have 128? Also, it can be changed, so maybe it should really be a
>> guc?
>
> I couldn't see any benefits above (and below) 32 on a 20 drive system,

So it is one kind of (big) hardware. Assuming that pages are contiguous, 
how much is written on each disk depends on the RAID type, the stripe 
size, and when it is really written depends on the various cache (in the 
RAID HW card if any, on the disk, ...), so whether 32 at the OS level is 
the right size is pretty unclear to me. I would have said the larger the 
better, but indeed you should avoid blocking.

> so I doubt it's worthwhile. It's actually good for interactivity to
> allow other requests into the queue concurrently - otherwise other
> reads/writes will obviously have a higher latency...

Sure. Now on my tests, with my (old & little) hardware it seemed quite 
smooth. What I'm driving at is that what is good may be relative and 
depend on the underlying hardware, which makes it not obvious to choose 
the right parameter.

>>> If you flush much more sync_file_range starts to do work in the
>>> foreground.
>>
>> Argh, too bad. I would have hoped that the would just deal with in an
>> asynchronous way,
>
> It's even in the man page:
> "Note  that  even  this  may  block if you attempt to write more than
> request queue size."

Hmmm. What about choosing "request queue size * 0.5", then ?

>> Because it should be in shared buffers where pg needs it?
>
> Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
> instead of sync_file_range().

I think that I do not really understand how it may work, but possible it 
could.

>> ISTM that it is rather an argument for taking the tablespace into the
>> sorting, not necessarily for a binary heap.
>
> I don't understand your problem with that. The heap specific code is
> small, smaller than your NextBufferToWrite() implementation?

You have not yet posted the updated version of the patch.

Thee complexity of the round robin scan on the array is O(1) and very few 
instructions, plus some stop condition which is mostly true I think if the 
writes are balanced between table spaces, there is no dynamic allocation 
in the data structure (it is an array). The binary heap is O(log(n)), 
probably there are dynamic allocations and frees when extracting/inserting 
something, there are functions calls to rebalance the tree, and so on. Ok, 
"n" is expected to be small.

So basically, for me it is not obviously superior to the previous version. 
Now I'm also tired, so if it works reasonably I'll be fine with it.

> [... code extract ...]

>>> I don't think that makes much sense. All a longer sleep achieves is
>>> creating a larger burst of writes afterwards. We should really sleep
>>> adaptively.
>>
>> It sounds reasonable, but what would be the criterion?
>
> What IsCheckpointOnSchedule() does is essentially to calculate progress
> for two things:
> 1) Are we on schedule based on WAL segments until CheckPointSegments
>   (computed via max_wal_size these days). I.e. is the percentage of
>   used up WAL bigger than the percentage of written out buffers.
>
> 2) Are we on schedule based on checkpoint_timeout. I.e. is the
>   percentage of checkpoint_timeout already passed bigger than the
>   percentage of buffers written out.

> So the trick is just to compute the number of work items (e.g. buffers
> to write out) and divide the remaining time by it. That's how long you
> can sleep.

See discussion above. ISTM that the "bursts" is a useful feature of the 
checkpoint scheduler, especially with sorted buffers & flushes. You want 
to provide grouped writes that will be easilly written to disk together. 
You do not want to have page writes issued one by one and interleaved with 
small sleeps.

> It's slightly trickier for WAL and I'm not sure it's equally
> important. But even there it shouldn't be too hard to calculate the
> amount of time till we're behind on schedule and only sleep that long.

The scheduler stops writing as soon as it has overtaken the progress, so 
it should be a very small time, but if you do that you would end up 
writing pages one by one, which is not desirable at all.

> I'm running benchmarks right now, they'll take a bit to run to
> completion.

Good.

I'm looking forward to have a look at the updated version of the patch.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Re: [BUGS] BUG #13694: Row Level Security by-passed with CREATEUSER permission
Next
From: YUriy Zhuravlev
Date:
Subject: Re: clearing opfuncid vs. parallel query