Re: Partitioned checkpointing - Mailing list pgsql-hackers

From Takashi Horikawa
Subject Re: Partitioned checkpointing
Date
Msg-id 73FA3881462C614096F815F75628AFCD03558B38@BPXM01GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: Partitioned checkpointing  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Partitioned checkpointing  (Takashi Horikawa <t-horikawa@aj.jp.nec.com>)
List pgsql-hackers
Hi,

>     I understand that what this patch does is cutting the checkpoint
> of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
> its own wal-log entry, pacing, fsync and so on.
Right.
However,
> The key point is that we spread out the fsyncs across the whole checkpoint
> period.
this is not the key point of the 'partitioned checkpointing,' I think.
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'

>     This method interacts with the current proposal to improve the
> checkpointer behavior by avoiding random I/Os, but it could be combined.
I agree.

> Splitting with N=16 does nothing to guarantee the partitions are equally
> sized, so there would likely be an imbalance that would reduce the
> effectiveness of the patch.
May be right.
However, current method was designed with considering to split
buffers so as to balance the load as equally as possible;
current patch splits the buffer as
---
1st round: b[0], b[p], b[2p], … b[(n-1)p]
2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
---
where N is the number of buffers,
p is the number of partitions, and n = (N / p).

It would be extremely unbalance if buffers are divided as follow.
---
1st round: b[0], b[1], b[2], … b[n-1]
2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
…
p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
---


I'm afraid that I miss the point, but
> 2.
> Assign files to one of N batches so we can make N roughly equal sized
> mini-checkpoints
Splitting buffers with considering the file boundary makes FPW related processing
(in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
checkpointing' is strongly related to the decision of whether this buffer
is necessary to FPW or not at the time of inserting the xlog record.
# 'partition id = buffer id % number of partitions' is fairly simple.

Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories



> -----Original Message-----
> From: Simon Riggs [mailto:simon@2ndQuadrant.com]
> Sent: Friday, September 11, 2015 10:57 PM
> To: Fabien COELHO
> Cc: Horikawa Takashi(堀川 隆); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Partitioned checkpointing
>
> On 11 September 2015 at 09:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
>
>     Some general comments :
>
>
>
> Thanks for the summary Fabien.
>
>
>     I understand that what this patch does is cutting the checkpoint
> of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
> its own wal-log entry, pacing, fsync and so on.
>
>     I'm not sure why it would be much better, although I agree that
> it may have some small positive influence on performance, but I'm afraid
> it may also degrade performance in some conditions. So I think that maybe
> a better understanding of why there is a better performance and focus on
> that could help obtain a more systematic gain.
>
>
>
> I think its a good idea to partition the checkpoint, but not doing it this
> way.
>
> Splitting with N=16 does nothing to guarantee the partitions are equally
> sized, so there would likely be an imbalance that would reduce the
> effectiveness of the patch.
>
>
>     This method interacts with the current proposal to improve the
> checkpointer behavior by avoiding random I/Os, but it could be combined.
>
>     I'm wondering whether the benefit you see are linked to the file
> flushing behavior induced by fsyncing more often, in which case it is quite
> close the "flushing" part of the current "checkpoint continuous flushing"
> patch, and could be redundant/less efficient that what is done there,
> especially as test have shown that the effect of flushing is *much* better
> on sorted buffers.
>
>     Another proposal around, suggested by Andres Freund I think, is
> that checkpoint could fsync files while checkpointing and not wait for the
> end of the checkpoint. I think that it may also be one of the reason why
> your patch does bring benefit, but Andres approach would be more systematic,
> because there would be no need to fsync files several time (basically your
> patch issues 16 fsync per file). This suggest that the "partitionning"
> should be done at a lower level, from within the CheckPointBuffers, which
> would take care of fsyncing files some time after writting buffers to them
> is finished.
>
>
> The idea to do a partial pass through shared buffers and only write a fraction
> of dirty buffers, then fsync them is a good one.
>
> The key point is that we spread out the fsyncs across the whole checkpoint
> period.
>
> I think we should be writing out all buffers for a particular file in one
> pass, then issue one fsync per file.  >1 fsyncs per file seems a bad idea.
>
> So we'd need logic like this
> 1. Run through shared buffers and analyze the files contained in there 2.
> Assign files to one of N batches so we can make N roughly equal sized
> mini-checkpoints 3. Make N passes through shared buffers, writing out files
> assigned to each batch as we go
>
> --
>
> Simon Riggs                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Autonomous Transaction is back
Next
From: Takashi Horikawa
Date:
Subject: Re: Partitioned checkpointing