Re: Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers

From didier
Subject Re: Design proposal: fsync absorb linear slider
Date
Msg-id CAJRYxuJFBJQcCAd2-0kfWecY=3T5qqDWnMaZ2vf3_g0HAMXpJg@mail.gmail.com
Whole thread Raw
In response to Re: Design proposal: fsync absorb linear slider  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
Hi,


On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith <greg@2ndquad (needrant.com> wrote:
On 7/26/13 9:14 AM, didier wrote:
During recovery you have to load the log in cache first before applying WAL.

Checkpoints exist to bound recovery time after a crash.  That is their only purpose.  What you're suggesting moves a lot of work into the recovery path, which will slow down how long it takes to process.

Yes it's slower but you're sequentially reading only one file at most the size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty WAL.

Now recovery  is very fast but you have to repopulate your cache with random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of the snapshot data are stalled and will be overwritten by WAL ops.

But  If the WAL recovery is CPU bound, loading from the snapshot may be done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of checkpoint_timeout='5 minutes', expecting that crash recovery won't take very long, will discover it does take a longer time now.  They'll be forced to shrink the value to get the same recovery time as they do currently.  You might need to make checkpoint_timeout 3 minutes instead, if crash recovery now has all this extra work to deal with.  And when the time between checkpoints drops, it will slow the fundamental efficiency of checkpoint processing down.  You will end up writing out more data in the end.
Yes it's a trade off, now you're paying the price at checkpoint time, every time,  with the log you're paying only once, at recovery.

The interval between checkpoints and recovery time are all related.  If you let any one side of the current requirements slip, it makes the rest easier to deal with.  Those are all trade-offs though, not improvements.  And this particular one is already an option.

If you want less checkpoint I/O per capita and don't care about recovery time, you don't need a code change to get it.  Just make checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you only do a checkpoint per hour, because instead of random writes you're getting sequential ones to the WAL.  But when you crash, expect to be down for a significant chunk of an hour, as you go back to sort out all of the work postponed before.
It's not the same  it's a snapshot saved and loaded in constant time unlike the WAL log.

Didier

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Condition to become the standby mode.
Next
From: Fujii Masao
Date:
Subject: Re: Condition to become the standby mode.