Re: Load Distributed Checkpoints, final patch - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Re: Load Distributed Checkpoints, final patch |
Date | |
Msg-id | 468A0B5E.9060304@enterprisedb.com Whole thread Raw |
In response to | Re: Load Distributed Checkpoints, final patch (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Load Distributed Checkpoints, final patch
Re: Load Distributed Checkpoints, final patch |
List | pgsql-patches |
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> Heikki Linnakangas wrote: >>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9 >>> minutes there, and the graphs look very smooth. That suggests that >>> spreading the writes over a longer time wouldn't make a difference, but >>> smoothing the rush at the beginning of checkpoint might. I'm going to >>> try the algorithm I posted, that uses the WAL consumption rate from >>> previous checkpoint interval in the calculations. > >> One thing that concerns me is that checkpoint smoothing happening just >> after the checkpoint is causing I/O at the same time that >> full_page_writes is causing additional I/O. > > I'm tempted to just apply some sort of nonlinear correction to the > WAL-based progress measurement. Squaring it would be cheap but is > probably too extreme. Carrying over info from the previous cycle > doesn't seem like it would help much; rather, the point is exactly > that we *don't* want a constant write speed during the checkpoint. While thinking about this, I made an observation on full_page_writes. Currently, we perform a full page write whenever LSN < RedoRecPtr. If we're clever, we can skip or defer some of the full page writes: The rule is that when we replay, we need to always replay a full page image before we apply any regular WAL records on the page. When we begin a checkpoint, there's two possible outcomes: we crash before the new checkpoint is finished, and we replay starting from the previous redo ptr, or we finish the checkpoint successfully, and we replay starting from the new redo ptr (or we don't crash and don't need to recover). To be able to recover from the previous redo ptr, we don't need to write a full page image if we have already written one since the previous redo ptr. To be able to recover from the new redo ptr, we don't need to write a full page image, if we haven't flushed the page yet. It will be written and fsync'd by the time the checkpoint finishes. IOW, we can skip full page images of pages that we have already taken a full page image of since previous checkpoint, and we haven't flushed yet during the current checkpoint. This might reduce the overall WAL I/O a little bit, but more importantly, it spreads the impact of taking full page images over the checkpoint duration. That's a good thing on its own, but it also makes it unnecessary to compensate for the full_page_writes rush in the checkpoint smoothing. I'm still trying to get my head around the bookkeeping required to get that right; I think it's possible using the new BM_CHECKPOINT_NEEDED flag and a new flag in the page header to mark pages that we've skipped taking the full page image when it was last modified. For 8.3, we should probably just do some simple compensation in the checkpoint throttling code, if we want to do anything at all. But this is something to think about in the future. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-patches by date: