Re: Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Design proposal: fsync absorb linear slider
Date
Msg-id CA+TgmoavXUuiWAowqryoMcmNSPWeO=5eyvcYT9bzHqp4=2-RdA@mail.gmail.com
Whole thread Raw
In response to Re: Design proposal: fsync absorb linear slider  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>
>>> We know that a 1GB relation segment can take a really long time to write
>>> out.  That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better!  I can pinpoint exactly what time last night I got tired enough
> to start making trivial mistakes.  Everywhere I said 128 it's actually
> 131,072, which just changes the range of the GUC I proposed.
>
> Getting the number right really highlights just how bad the current
> situation is.  Would you expect the database to dump up to 128K writes into
> a file and then have low latency when it's flushed to disk with fsync?  Of
> course not.  But that's the job the checkpointer process is trying to do
> right now.  And it's doing it blind--it has no idea how many dirty pages
> might have accumulated before it started.
>
> I'm not exactly sure how best to use the information collected.  fsync every
> N writes is one approach.  Another is to use accumulated writes to predict
> how long fsync on that relation should take.  Whenever I tried to spread
> fsync calls out before, the scale of the piled up writes from backends was
> the input I really wanted available.  The segment write count gives an
> alternate way to sort the blocks too, you might start with the heaviest hit
> ones.
>
> In all these cases, the fundamental I keep coming back to is wanting to cue
> off past write statistics.  If you want to predict relative I/O delay times
> with any hope of accuracy, you have to start the checkpoint knowing
> something about the backend and background writer activity since the last
> one.

So, I don't think this is a bad idea; in fact, I think it'd be a good
thing to explore.  The hard part is likely to be convincing ourselves
of anything about how well or poorly it works on arbitrary hardware
under arbitrary workloads, but we've got to keep trying things until
we find something that works well, so why not this?

One general observation is that there are two bad things that happen
when we checkpoint.  One is that we force all of the data in RAM out
to disk, and the other is that we start doing lots of FPIs.  Both of
these things harm throughput.  Your proposal allows the user to make
the first of those behaviors more frequent without making the second
one more frequent.  That idea seems promising, and it also seems to
admit of many variations.  For example, instead of issuing an fsync
when after N OS writes to a particular file, we could fsync the file
with the most writes every K seconds.  That way, if the system has
busy and idle periods, we'll effectively "catch up on our fsyncs" when
the system isn't that busy, and we won't bunch them up too much if
there's a sudden surge of activity.

Now that's just a shot in the dark and there might be reasons why it's
terrible, but I just generally offer it as food for thought that the
triggering event for the extra fsyncs could be chosen via a multitude
of different algorithms, and as you hack through this it might be
worth trying a few different possibilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: dynamic background workers, round two
Next
From: Robert Haas
Date:
Subject: Re: Review: UNNEST (and other functions) WITH ORDINALITY