On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
> I like the idea of acquiring snapshots locally in the slave much more.
> As you mentioned, the options there are to defer applying WAL, or cancel
> queries.
More exotic ways to defer applying WAL include using some smart
filesystems to get per-backend data snapshots, using either
copy-of-write overlay filesystems and filesystem or disk level
snapshots.
Al least the disk level snapshots exist in SAN-s with aim of easing
backups, though I'm not sure if it is effective for use hot standby
intended use.
Using any of those needs detecting and bypassing shared buffers if they
hold "too new" data pages and reading these pages directly from disk
snapshot.
> I think both options need the same ability to detect when
> you're about to remove a tuple that's still visible to some snapshot,
> just the action is different. We should probably provide a GUC to
> control which you want.
We probably need to have two LSN's per page to make maximal use of our
MVCC in Hot Standby situation, so we can distinguish addition to a page,
which implies no data loss from row removal which does. Currently only
Vacuum and Hot pruning can cause row removal.
> However, if we still to provide the behavior that "as long as the
> network connection works, the master will not remove tuples still needed
> in the slave" as an option, a lot simpler implementation is to
> periodically send the slave's oldest xmin to master. Master can take
> that into account when calculating its own oldest xmin. That requires a
> lot less communication than the proposed scheme to send snapshots back
> and forth. A softer version of that is also possible, where the master
> obeys the slave's oldest xmin, but only up to a point.
That point could be statement_timeout or (currently missing)
transaction_timeout
Also, decision to advance xmin should probably be sent to slave as well,
even though it is not something that is needed in local WAL logs.
--------------
Hannu