Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20151021133503.GO3485@alap3.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers
On 2015-10-21 07:49:23 +0200, Fabien COELHO wrote:
>
> Hello Andres,
>
> >>>In my performance testing it showed that calling PerformFileFlush() only
> >>>at segment boundaries and in CheckpointWriteDelay() can lead to rather
> >>>spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
> >>>problematic because it only is triggered while on schedule, and not when
> >>>behind.
> >>
> >>When behind, the PerformFileFlush should be called on segment
> >>boundaries.
> >
> >That means it's flushing up to a gigabyte of data at once. Far too
> >much.
>
> Hmmm. I do not get it. There would not be gigabytes,

I said 'up to a gigabyte' not gigabytes. But it actually can be more
than one  if you're unluckly.

> there would be as much as was written since the last sleep, about 100
> ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot. And even so, it's pretty easy to get into
checkpoint scenarios with ~500 mbyte/s as a writeout rate. Only issuing
a sync_file_range() 10 times for that is obviously problematic.

> >The implementation pretty always will go behind schedule for some
> >time. Since sync_file_range() doesn't flush in the foreground I don't
> >think it's important to do the flushing in concert with sleeping.
>
> For me it is important to avoid accumulating too large flushes, and that is
> the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

> >>>My testing seems to show that just adding a limit of 32 buffers to
> >>>FileAsynchronousFlush() leads to markedly better results.
> >>
> >>Hmmm. 32 buffers means 256 KB, which is quite small.
> >
> >Why?
>
> Because the point of sorting is to generate sequential writes so that the
> HDD has a lot of aligned stuff to write without moving the head, and 32 is
> rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue. You can have merging
between IOs from either side.  But more importantly you can't merge that
many requests together anyway.

> >The aim is to not overwhelm the request queue - which is where the
> >coalescing is done. And usually that's rather small.
>
> That is an argument. How small, though? It seems to be 128 by default, so
> I'd rather have 128? Also, it can be changed, so maybe it should really be a
> guc?

I couldn't see any benefits above (and below) 32 on a 20 drive system,
so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

> >If you flush much more sync_file_range starts to do work in the
> >foreground.
>
> Argh, too bad. I would have hoped that the would just deal with in an
> asynchronous way,

It's even in the man page:
"Note  that  even  this  may  block if you attempt to write more than
request queue size."

> this is not a "fsync" call, just a flush advise.

sync_file_range isn't fadvise().

> Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

> >>Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
> >>set in most case is an improvement.
> >
> >Yes, it'll not matter that much in many cases. But I rather disliked the
> >NextBufferToWrite() implementation, especially that it walkes the array
> >multiple times. And I did see setups with ~15 tablespaces.
>
> ISTM that it is rather an argument for taking the tablespace into the
> sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?
   ts_heap = binaryheap_allocate(nb_spaces,                                 ts_progress_cmp,
    NULL);
 
   spcContext = (FileFlushContext *)       palloc(sizeof(FileFlushContext) * nb_spaces);
   for (i = 0; i < nb_spaces; i++)   {       TableSpaceCheckpointStatus *spc = &spcStatus[i];
       spc->progress_slice = ((float8) num_to_write) / (float8) spc->num_to_write;
       ResetFileFlushContext(&spcContext[i]);       spc->flushContext = &spcContext[i];
       binaryheap_add_unordered(ts_heap, PointerGetDatum(&spcStatus[i]));   }
   binaryheap_build(ts_heap);

and then
   while (!binaryheap_empty(ts_heap))   {       TableSpaceCheckpointStatus *ts = (TableSpaceCheckpointStatus *)
 DatumGetPointer(binaryheap_first(ts_heap));
 

...       ts->progress += ts->progress_slice;       ts->num_written++;
...       if (ts->num_written == ts->num_to_write)       {
...           binaryheap_remove_first(ts_heap);       }       else       {           /* update heap with the new
progress*/           binaryheap_replace_first(ts_heap, PointerGetDatum(ts));       }
 

> >>I also noted this point, but I'm not sure how to have a better approach, so
> >>I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
> >>effect on performance for the test I ran then. The point of having not too
> >>small a value is that it provide some significant work to the IO subsystem
> >>without overflowing it.
> >
> >I don't think that makes much sense. All a longer sleep achieves is
> >creating a larger burst of writes afterwards. We should really sleep
> >adaptively.
>
> It sounds reasonable, but what would be the criterion?

What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments  (computed via max_wal_size these days). I.e. is
thepercentage of  used up WAL bigger than the percentage of written out buffers.
 

2) Are we on schedule based on checkpoint_timeout. I.e. is the  percentage of checkpoint_timeout already passed bigger
thanthe  percentage of buffers written out.
 

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.


I'm running benchmarks right now, they'll take a bit to run to
completion.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Getting sorted data from foreign server
Next
From: Euler Taveira
Date:
Subject: pg_recvlogical fixes