Re: Partitioned checkpointing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Partitioned checkpointing
Date
Msg-id 55F2F946.30403@2ndquadrant.com
Whole thread Raw
In response to Re: Partitioned checkpointing  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Partitioned checkpointing  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On 09/11/2015 03:56 PM, Simon Riggs wrote:
>
> The idea to do a partial pass through shared buffers and only write a
> fraction of dirty buffers, then fsync them is a good one.
>
> The key point is that we spread out the fsyncs across the whole
> checkpoint period.

I doubt that's really what we want to do, as it defeats one of the 
purposes of spread checkpoints. With spread checkpoints, we write the 
data to the page cache, and then let the OS to actually write the data 
to the disk. This is handled by the kernel, which marks the data as 
expired after some time (say, 30 seconds) and then flushes them to disk.

The goal is to have everything already written to disk when we call 
fsync at the beginning of the next checkpoint, so that the fsync are 
cheap and don't cause I/O issues.

What you propose (spreading the fsyncs) significantly changes that, 
because it minimizes the amount of time the OS has for writing the data 
to disk in the background to 1/N. That's a significant change, and I'd 
bet it's for the worse.

>
> I think we should be writing out all buffers for a particular file
> in one pass, then issue one fsync per file. >1 fsyncs per file seems
> a bad idea.
>
> So we'd need logic like this
> 1. Run through shared buffers and analyze the files contained in there
> 2. Assign files to one of N batches so we can make N roughly equal sized
> mini-checkpoints
> 3. Make N passes through shared buffers, writing out files assigned to
> each batch as we go

What I think might work better is actually keeping the write/fsync 
phases we have now, but instead of postponing the fsyncs until the next 
checkpoint we might spread them after the writes. So with target=0.5 
we'd do the writes in the first half, then the fsyncs in the other half. 
Of course, we should sort the data like you propose, and issue the 
fsyncs in the same order (so that the OS has time to write them to the 
devices).

I wonder how much the original paper (written in 1996) is effectively 
obsoleted by spread checkpoints, but the benchmark results posted by 
Horikawa-san suggest there's a possible gain. But perhaps partitioning 
the checkpoints is not the best approach?

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Next
From: Teodor Sigaev
Date:
Subject: Review: check existency of table for -t option (pg_dump) when pattern...