Re: Hot standby, recovery infra - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hot standby, recovery infra
Date
Msg-id 1233830036.4500.422.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: Hot standby, recovery infra  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Hot standby, recovery infra
List pgsql-hackers
On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:

> > So we might end up flushing more often *and* we will be doing it
> > potentially in the code path of other users.
> 
> For example, imagine a database that fits completely in shared buffers. 
> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
> If we update in XLogFlush the way I described, you only need to update 
> when we flush a page from the buffer cache, which will only happen at 
> restartpoints. That's far less updates.

Oh, did you change the bgwriter so it doesn't do normal page cleaning? 

General thoughts: Latest HS patch has a CPU profile within 1-2% of
current and the use of ProcArrayLock is fairly minimal now. The
additional CPU is recoveryStopsHere(), which enables the manual control
of recovery, so the trade off seems worth it. The major CPU hog remains
RecordIsValid, which is the CRC checks. Startup is still I/O bound.
Specific avoidable I/O hogs are (1) checkpoints, (2) page cleaning so I
hope we can avoid both of those. 

I also hope to minimise the I/O cost of keeping track of the consistency
point. If that was done as infrequently as each restartpoint then I
would certainly be very happy, but that won't happen in the proposed
scheme if we do page cleaning.

> Expanding that example to a database that doesn't fit in cache, you're 
> still replacing pages from the buffer cache that have been untouched for 
> longest. Such pages will have an old LSN, too, so we shouldn't need to 
> update very often.

They will tend to be written in ascending LSN order which will mean we
continually update the control file. Anything out of order does skip a
write. The better the cache is at finding LRU blocks out the more writes
we will make.

> I'm sure you can come up with an example of where we end up fsyncing 
> more often, but it doesn't seem like the common case to me.

I'm not trying to come up with counterexamples...

> > This change seems speculative and also against what has previously been
> > agreed with Tom. If he chooses not to comment on your changes, that's up
> > to him, but I don't think you should remove things quietly that have
> > been put there through the community process, as if they caused
> > problems. I feel like I'm in the middle here. 
> 
> I'd like to have the extra protection that this approach gives. If we 
> let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
> to just assume we're fine if we reach end of WAL before reaching that 
> point. That assumption falls down if e.g recovery is stopped, and you go 
> and remove the last few WAL segments from the archive before restarting 
> it, or signal pg_standby to trigger failover too early. Tracking the 
> real safe starting point and enforcing it always protects you from that.

Doing it this way will require you to remove existing specific error
messages about ending before end time of backup, to be replaced by more
general ones that say "consistency not reached" which is harder to
figure out what to do about it.

> (we did discuss this a week ago: 
> http://archives.postgresql.org/message-id/4981F6E0.2040503@enterprisedb.com)

Yes, we need to update it in that case. Though that is no way agreement
to the other changes, nor does it require them.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Hot standby, recovery infra
Next
From: Fujii Masao
Date:
Subject: Re: Synch Replication