Re: Summary and Plan for Hot Standby - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Summary and Plan for Hot Standby
Date
Msg-id 1258295578.14054.1432.camel@ebony
Whole thread Raw
In response to Re: Summary and Plan for Hot Standby  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Summary and Plan for Hot Standby
Re: Summary and Plan for Hot Standby
List pgsql-hackers
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > There are two remaining areas of significant thought/effort:
> 
> Here's a list of other TODO items I've collected so far. Some of them
> are just improvements or nice-to-have stuff, but some are more serious:
> 
> - If WAL recovery runs out of lock space while acquiring an
> AccessExclusiveLock on behalf of a transaction that ran in the master,
> it will FATAL and abort recovery, bringing down the standby. Seems like
> it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

> - When switching from standby mode to normal operation, we momentarily
> hold all AccessExclusiveLocks held by prepared xacts twice, needing
> twice the lock space. You can run out of lock space at that point,
> causing failover to fail.

That was the issue I mentioned.

> - When replaying b-tree deletions, we currently wait out/cancel all
> running (read-only) transactions. We take the ultra-conservative stance
> because we don't know how recent the tuples being deleted are. If we
> could store a better estimate for latestRemovedXid in the WAL record, we
> could make that less conservative.

Exactly my point. There are already parts of the patch that may cause
usage problems and need further thought. The earlier we get this to
people the earlier we will find out what they all are and begin doing
something about them.

> - The assumption that b-tree vacuum records don't need conflict
> resolution because we did that with the additional cleanup-info record
> works ATM, but it hinges on the fact that we don't delete any tuples
> marked as killed while we do the vacuum. That seems like a low-hanging
> fruit that I'd actually like to do now that I spotted it, but will then
> need to fix b-tree vacuum records accordingly. We'd probably need to do
> something about the previous item first to keep performance acceptable.
> 
> - There's the optimization to replay of b-tree vacuum records that we
> discussed earlier: Replay has to touch all leaf pages because of the
> interlock between heap scans, to ensure that we don't vacuum away a heap
> tuple that a concurrent index scan is about to visit. Instead of
> actually reading in and pinning all pages, during replay we could just
> check that the pages that don't need any other work to be done are not
> currently pinned in the buffer cache.

Yes, its an optimization. Not one I consider critical, yet cool and
interesting.

> - Do we do the b-tree page pinning explained in previous point correctly
> at the end of index vacuum? ISTM we're not visiting any pages after the
> last page that had dead tuples on it.

Looks like a new bug, not previously mentioned.

> - code structure. I moved much of the added code to a new standby.c
> module that now takes care of replaying standby related WAL records. But
> there's code elsewhere too. I'm not sure if this is a good division but
> seems better than the original ad hoc arrangement where e.g lock-related
> WAL handling was in inval.c

> - The "standby delay" is measured as current timestamp - timestamp of
> last replayed commit record. If there's little activity in the master,
> that can lead to surprising results. For example, imagine that
> max_standby_delay is set to 8 hours. The standby is fully up-to-date
> with the master, and there's no write activity in master.  After 10
> hours, a long reporting query is started in the standby. Ten minutes
> later, a small transaction is executed in the master that conflicts with
> the reporting query. I would expect the reporting query to be canceled 8
> hours after the conflicting transaction began, but it is in fact
> canceled immediately, because it's over 8 hours since the last commit
> record was replayed.

An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.

> - ResolveRecoveryConflictWithVirtualXIDs polls until the victim
> transactions have ended. It would be much nicer to sleep. We'd need a
> version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
> patch for lock timeouts recently. Maybe we could borrow code from that?

Nice? 

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Hot standby, race condition between recovery snapshot and commit
Next
From: Greg Stark
Date:
Subject: Re: Summary and Plan for Hot Standby