Thread: Transaction Snapshots and Hot Standby

Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
Transaction snapshots is probably the most difficult problem for Hot
Standby to resolve. We *must* have a transaction snapshot to access
table data in a consistent way, or we must accept some inconsistency,
though that is not simple either.

We can derive transaction snapshots
     * remotely from primary node     * locally on the standby node

If we derive a snapshot locally, then we will end up with a situation
where the xmin of the local snapshot precedes the xmin of the primary
node. When this occurs it will then be possible for WAL records to
arrive on the standby that request removal of rows that a transaction
might wish to see. Preventing that situation can be done by either
deferring WAL apply or by cancelling queries. 

We can defer WAL apply for particular tables only, but this could
significantly complicate the apply process and is not a suggested option
for the first release of this feature.

We might control query cancellation by tracking which tables have had
rows removed that would have been visible to particular queries. Again,
possible but suffers from the problem that tables on which HOT is
frequently active would be almost unusable. So not a general solution.

Simply ignoring WAL removal has been discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php

So the most generally applicable and simplest initial solution for
generating snapshots is to take them from the remote primary node. The
proposal for this follows:

Transaction snapshots will be managed by a new process, Snapproc.
Snapproc will start only in recovery mode and will exit when normal
processing mode begins. Snapshot process will connect to the remote
primary node and export snapshot data and copy this to shared memory on
the standby node. When all standby backends have run
UnregisterSnapshot() then the snapshot will then be unregistered on the
remote primary node also.

The standby must not think a transaction is visible until all changes
made by it are have been applied. So snapshots from the primary cannot
be used on the standby until the LSN at which they were taken has been
reached by WAL apply on the standby. Snapshots don't normally have an
LSN, so we must derive that information as well when we take a Snapshot.
With asynchronous commits a transaction can be committed and yet not yet
written to disk, so we cannot rely on the current WAL write pointer.
Taking the WAL insert pointer is the safest thing to do, though most
heavily contended. We don't want to hold ProcArrayLock while requesting
WALInsertLock, so we will get the LSN of the WAL insert pointer *after*
the Snapshot has been derived (it doesn't matter that much, as long as
its not before the snapshot).

So snapshots from the primary take time before they can be used. The
delay is exactly the current processing delay from primary to standby,
so another reason why we do not wish to fall behind.

Taking snapshots from primary has a few disadvantages
     * snapshots take time before they are usable     * requesting new snapshots is via remote request     * snapshots
onprimary prevent row removal (but this was also an       advantage of this technique!)
 

If primary and standby node are connected by private ethernet then the
snapshot request time will be ~10ms, though that is probably 100 times
slower than current snapshot access. If primary and standby are opposite
sides of an ocean then times could be as high as 500ms. Taking snapshots
directly could be annoyingly slow for small read-only statements, so we
need to buffer this process in various ways. For larger queries, this
may not be a problem at all, but we do not wish to limit or assume the
uses of the standby node. 

First of all, each backend will have the opportunity to reuse previous
snapshots both within transactions and across them. A userset parameter
snapshot_reuse_window = 0..60000ms will define the time window in which
any request for a new snapshot will simply result in being fed the last
snapshot again. When the window on a snapshot has expired a newer
snapshot will be presented instead. This idea is similar to serializable
transactions, which continually reuse the same snapshot. This is a
useful parameter for normal processing as well, since it will reduce
contention on the ProcArrayLock for many applications. 

Snapshots can be reused across transactions in recovery mode, since they
are held in shared memory. Amount of shared memory dedicated to storing
snapshots will be max_connections * max size of snapshots.

Since there is a delay between obtaining a new snapshot and it becoming
usable the Snapshot process will buffer them until they become "mature",
like a good Whiskey. Snapshot process will take regular snapshots and
pre-age them so that when a backend requests a snapshot it will be given
the most recently matured snapshot. Time between snapshots is set by
snapshot_preread_timeout = 0..60000ms. If a request for a snapshot
arrives and there are no snapshots waiting to mature then this will
trigger snapshot process to request a new primary snapshot. (This
parameter could be automatically set based upon the arrival rate of
snapshot requests, but this is a something to consider later).

If snapshot_reuse_window = 0 then a backend will be presented with a
freshly obtained snapshot and will then wait until the exact first
moment it can be used before returning.

We can continue to reuse snapshots from the primary even if the primary
crashes, becomes disconnected or is shutdown/restarted. New snapshots
are obviously not possible until it appears again. It's not that common
for us to lose contact with the primary *and* for it to *not* be a
failover, so this seems like an acceptable restriction.

Other alternatives? Maybe.

In earlier discussions, I suggested that we might use "Read Uncommitted"
mode for use with Hot Standby. Tom pointed out that what I had suggested
was not "Read Uncommitted" as described by SQL Standard.

For Hot Standby, I've had a look at the various transaction isolation
modes possible to see which, if any, are desirable:

1. The first is to just accept that transaction *isolation* is not
possible, but carry on anyway. We might call this READ INCONSISTENT -
which carries the problems noted by Tom earlier. This mode allows us to
"see" committed deletes because their row versions might be missing from
our query, it will cause updated records to disappear from queries and
yet at the same time for newly inserted data to be invisible. However,
if the table is insert only or read only this will give a consistent
result, so this mode *is* appealing to many potential users. (It's been
suggested to me twice, independently.) We would need to derive a
snapshot based upon which transactions are "running" according to the
replay of transactions during recovery.

2. The second is to throw an error if transaction isolation is violated.
Currently we can only do this approximately, by recording the global LSN
at start of query and then aborting the query if we touch a data block
that has changes made after the LSN. Slightly more extreme version of
(1), but always consistent if it completes. (Somebody has joked we could
make the error message "snapshot too old").

We can implement both of those with an additional parameter,
allow_xmin_advance = off (default) | on Mode (1) is allow_xmin_advance =
on in READ COMMITTED mode Mode (2) is allow_xmin_advance = on in
SERIALIZABLE mode

3. We can also implement Read Uncommitted mode itself in the database.
We would get inconsistent results for updates and deletes *and* inserts
but at least we would always see the latest version of an updated row,
rather than skipping them completely. On a read only table this might be
very useful. On an insert only table this might result in some long
running queries as a scan struggles to keep up with inserts!

For (3) we would need to retrieve tuple from executor scan nodes in
SnapshotDirty mode. If we see a tuple we would follow its tuple chain to
another row on the same block using EvalPlanQual() logic. If we are
using an IndexScan and the update chain goes off-block then there will
be another index pointer that we (may) see to access the tuple, so we
would stop at that point. If we are using a SeqScan we should follow the
tuple chain until we see the top tuple, even if it goes off-block.
Bitmap scans would cause problems.

Deriving snapshots from the master seems the most obvious, but lets see
where the dice roll...

Other topics on Hot Standby will be covered later; this is just the
first part.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Taking snapshots from primary has a few disadvantages
> 
>  ...
>       * snapshots on primary prevent row removal (but this was also an
>         advantage of this technique!)

That makes it an awful solution for high availability. A backend hung in 
transaction-in-progress state in the slave will prevent row removal on 
the master. Isolating the master from queries done performed in the 
slave is exactly the reason why people use hot standby. And running long 
reporting queries in the standby is again a very typical use case.

And still we can't escape the scenario that the slave receives a WAL 
record that vacuums away a tuple that's still visible according to a 
snapshot used in the slave. Even with the proposed scheme, this can happen:

1. Slave receives a snapshot from master
2. A long-running transaction begins on the slave, using that snapshot
3. Network connection is lost
4. Master hits a timeout, and decides to discard the snapshot it sent to 
the slave
5. A tuple visible to the snapshot is vacuumed
6. Network connection is re-established
7. Slave receives the vacuum WAL record, even though the long-running 
transaction still needs the tuple.

I like the idea of acquiring snapshots locally in the slave much more. 
As you mentioned, the options there are to defer applying WAL, or cancel 
queries. I think both options need the same ability to detect when 
you're about to remove a tuple that's still visible to some snapshot, 
just the action is different. We should probably provide a GUC to 
control which you want.

However, if we still to provide the behavior that "as long as the 
network connection works, the master will not remove tuples still needed 
in the slave" as an option, a lot simpler implementation is to 
periodically send the slave's oldest xmin to master. Master can take 
that into account when calculating its own oldest xmin. That requires a 
lot less communication than the proposed scheme to send snapshots back 
and forth. A softer version of that is also possible, where the master 
obeys the slave's oldest xmin, but only up to a point.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:

> I like the idea of acquiring snapshots locally in the slave much more. 
> As you mentioned, the options there are to defer applying WAL, or cancel 
> queries. 

More exotic ways to defer applying WAL include using some smart
filesystems to get per-backend data snapshots, using either
copy-of-write overlay filesystems and filesystem or disk level
snapshots.

Al least the disk level snapshots exist in SAN-s with aim of easing
backups, though I'm not sure if it is effective for use hot standby
intended use.

Using any of those needs detecting and bypassing shared buffers if they
hold "too new" data pages and reading these pages directly from disk
snapshot.

> I think both options need the same ability to detect when 
> you're about to remove a tuple that's still visible to some snapshot, 
> just the action is different. We should probably provide a GUC to 
> control which you want.

We probably need to have two LSN's per page to make maximal use of our
MVCC in Hot Standby situation, so we can distinguish addition to a page,
which implies no data loss from row removal which does. Currently only
Vacuum and Hot pruning can cause row removal.

> However, if we still to provide the behavior that "as long as the 
> network connection works, the master will not remove tuples still needed 
> in the slave" as an option, a lot simpler implementation is to 
> periodically send the slave's oldest xmin to master. Master can take 
> that into account when calculating its own oldest xmin. That requires a 
> lot less communication than the proposed scheme to send snapshots back 
> and forth. A softer version of that is also possible, where the master 
> obeys the slave's oldest xmin, but only up to a point.

That point could be statement_timeout or (currently missing)
transaction_timeout

Also, decision to advance xmin should probably be sent to slave as well,
even though it is not something that is needed in local WAL logs.

--------------
Hannu




Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Taking snapshots from primary has a few disadvantages
> > 
> >  ...
> >       * snapshots on primary prevent row removal (but this was also an
> >         advantage of this technique!)
> 
> That makes it an awful solution for high availability. 

Please be careful about making such statements. People might think you
were opposing the whole idea of Hot Standby, rather than giving an
opinion about one suggestion out of many implementation proposals.

Looks like you've got some good additional suggestions later in the
post. I'll reply later to those, so thanks for that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Richard Huxton
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> Taking snapshots from primary has a few disadvantages
>>
>>  ...
>>       * snapshots on primary prevent row removal (but this was also an
>>         advantage of this technique!)
> 
> That makes it an awful solution for high availability. A backend hung in
> transaction-in-progress state in the slave will prevent row removal on
> the master. Isolating the master from queries done performed in the
> slave is exactly the reason why people use hot standby. And running long
> reporting queries in the standby is again a very typical use case.

I have to say I agree with Heikki here. Blocking the master based on
what the slave is doing seems to make host standby less useful than warm.

> I like the idea of acquiring snapshots locally in the slave much more.

It's the option that I can see people (well, me) understanding the
easiest. All the others sound like ways to get things wrong. As for
inconsistent query-results - that way madness lies. How on earth will
anyone be able to diagnose or report bugs when they occur?

> As you mentioned, the options there are to defer applying WAL, or cancel
> queries. I think both options need the same ability to detect when
> you're about to remove a tuple that's still visible to some snapshot,
> just the action is different. We should probably provide a GUC to
> control which you want.

I think there's only one value here: "hot standby wal delay time before
cancelling query". Might be a shorter name.

--  Richard Huxton Archonet Ltd


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:

> I have to say I agree with Heikki here. Blocking the master based on
> what the slave is doing seems to make host standby less useful than warm.

I agree also, that why I flagged it up for discussion.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:
>
>> I have to say I agree with Heikki here. Blocking the master based on
>> what the slave is doing seems to make host standby less useful than warm.
>
> I agree also, that why I flagged it up for discussion.

So as far as I can see there are basically two option here. Either

a) transactions with live snapshots on the slave prevent the master from being  able to vacuum tuples (which defeats
thepurpose of having a live standby  server for some users).
 

or

b) vacuum on the server which cleans up a tuple the slave has in scope has to  block WAL reply on the slave (which I
supposedefeats the purpose of having  a live standby for users concerned more with fail-over latency).
 

Is there any middle ground or brilliant ways to get the best of both worlds?

If not it seems to me the latter is preferable since at least the consequence
of having a long-running query on the slave occurs on the same machine running
the query. And the work-around -- killing the long-running query -- requires
taking action on the same machine as the consequences. Also, when you take
action it fixes the problem immediately as WAL reply can recommence which
seems like a better deal than a bloated database.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Transaction Snapshots and Hot Standby

From
"Merlin Moncure"
Date:
On Thu, Sep 11, 2008 at 7:18 AM, Gregory Stark <stark@enterprisedb.com> wrote:
> a) transactions with live snapshots on the slave prevent the master from being
>   able to vacuum tuples (which defeats the purpose of having a live standby
>   server for some users).
>
> or
>
> b) vacuum on the server which cleans up a tuple the slave has in scope has to
>   block WAL reply on the slave (which I suppose defeats the purpose of having
>   a live standby for users concerned more with fail-over latency).

There is nothing stopping you from setting up two (or more) slave
servers, with one designated as failover that doens't serve queries,
right?  Option b seems pretty reasonable to me, although I'd prefer to
block wal replay vs canceling queries...although it might be nice to
manually be able to force wal replay 'with query cancel' via a
checkpoint.

merlin


Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Merlin Moncure wrote:
> There is nothing stopping you from setting up two (or more) slave
> servers, with one designated as failover that doens't serve queries,
> right?

I'd imagine that even if applying the WAL on the slave is blocked, it's 
still streamed from the master to the slave, and in case of failover the 
slave will fast-forward before starting up as the new master. Of course, 
if it has fallen 3 days behind because of a giant reporting query, it 
can take a while to replay all the WAL that has accumulated.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
> I'd imagine that even if applying the WAL on the slave is blocked, it's 
> still streamed from the master to the slave, and in case of failover the 
> slave will fast-forward before starting up as the new master.

Which begs the question: what happens with a query which is running on
the slave in the moment when the slave switches from recovery mode and
starts up ? Should the running queries be canceled if they are blocking
applying of WAL, to allow start-up, or let them finish ?

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> b) vacuum on the server which cleans up a tuple the slave has in scope has to
>    block WAL reply on the slave (which I suppose defeats the purpose of having
>    a live standby for users concerned more with fail-over latency).

One problem with this, BTW, is that if there's a continuous stream of 
medium-length transaction in the slave, each new snapshot taken will 
prevent progress in the WAL replay, so the WAL replay will advance in 
"baby steps", and can fall behind indefinitely. As soon as there's a 
moment that there's no active snapshot, it can catch up, but if the 
slave is seriously busy, that might never happen.

Nevertheless, I think it's a much nicer approach.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Csaba Nagy wrote:
> On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
>> I'd imagine that even if applying the WAL on the slave is blocked, it's 
>> still streamed from the master to the slave, and in case of failover the 
>> slave will fast-forward before starting up as the new master.
> 
> Which begs the question: what happens with a query which is running on
> the slave in the moment when the slave switches from recovery mode and
> starts up ? Should the running queries be canceled if they are blocking
> applying of WAL, to allow start-up, or let them finish ?

Depends on application, I'd say. I guess we'll need both, like the smart 
and fast shutdown modes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
> One problem with this, BTW, is that if there's a continuous stream of 
> medium-length transaction in the slave, each new snapshot taken will 
> prevent progress in the WAL replay, so the WAL replay will advance in 
> "baby steps", and can fall behind indefinitely.

Why would it fall behind indefinitely ? It only should fall behind to
the "blocking horizon", which should be the start of the longest
currently running transaction... which should be continually advancing
and not too far in the past if there are only medium length transactions
involved. Isn't normal WAL recovery also doing baby-steps, one WAL
record a time ? ;-)

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
"Jochem van Dieten"
Date:
On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote:
> Transaction snapshots is probably the most difficult problem for Hot
> Standby to resolve.

>      * remotely from primary node
>      * locally on the standby node
>
> If we derive a snapshot locally, then we will end up with a situation
> where the xmin of the local snapshot precedes the xmin of the primary
> node. When this occurs it will then be possible for WAL records to
> arrive on the standby that request removal of rows that a transaction
> might wish to see. Preventing that situation can be done by either
> deferring WAL apply or by cancelling queries.

Which operations can request row removal? Isn't that just specific
operations that have their own 'is this save to remove' calculations
anyway (i.e. vacuum and HOT prune)?

What I am thinking about is a design where the primary node were to
regularly push an OldestXMin into the WAL, the WAL apply process on
the standby nodes pushes it into shared memory and the backends keep
an OldestMasterXMin in shared memory. The standby nodes then regularly
pushes back the oldest OldestMasterXMin from all backends to the
master. Vacuum and HOT prune could then base their calculations on an
OldestXMin that is not the OldestXMin of the master itself, but of the
master and the standby nodes. That way removal of records that are
still visible on one of the standby nodes is prevented on the master
instead of worked around on the standby nodes.

The obvious downside would be bloat on the master (which could get out
of hand if a slave is a few days behind due to a long report), but I
think in terms of visibility and consistency this would work. Or am I
completely misunderstanding the problem?

Jochem


Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Csaba Nagy wrote:
> On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
>> One problem with this, BTW, is that if there's a continuous stream of 
>> medium-length transaction in the slave, each new snapshot taken will 
>> prevent progress in the WAL replay, so the WAL replay will advance in 
>> "baby steps", and can fall behind indefinitely.
> 
> Why would it fall behind indefinitely ? It only should fall behind to
> the "blocking horizon", which should be the start of the longest
> currently running transaction... which should be continually advancing
> and not too far in the past if there are only medium length transactions
> involved.

Well, yes, but you can fall behind indefinitely that way. Imagine that 
each transaction on the slave lasts, say 10 minutes, with a new 
transaction starting every 5 minutes. On the master, there's a table 
that's being vacuumed (or HOT-updated) frequently, say after each 
transaction for simplicity. What can happen is that every transaction 
that finishes on the slave will only let the WAL replay advance by one 
XID before blocking on the snapshot of the next slave transaction. The 
WAL replay will advance at a rate of 0.2 TPM, while the master is 
generating 1.0 TPM.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
"Robert Haas"
Date:
> I'd imagine that even if applying the WAL on the slave is blocked, it's
> still streamed from the master to the slave, and in case of failover the
> slave will fast-forward before starting up as the new master. Of course, if
> it has fallen 3 days behind because of a giant reporting query, it can take
> a while to replay all the WAL that has accumulated.

Yes, and meanwhile, any other queries that are running on that box are
seeing three-day old data as well.

In an ideal world, it would be nice if the slave could keep tuples
around even after they are dead and vacuumed on the master.  Pushing
Xmin from the slave to the master creates the possibility of bloating
the master due to activity on the slave, which is not going to improve
reliability.  OTOH, not pushing Xmin leads to several pathological
query behaviors on the slave: cancelled queries, inconsistent reads,
and falling way behind on WAL application.  Either way, it seems to me
a massive and uncomfortable violation of the POLA.

If it were possible for tuples that had been vacuumed on the master to
stick around on the slave for as long as the slave still needed them,
then you'd have the best of both worlds, but I have a feeling
someone's going to say that that's just about impossible to implement.Against that, all I can say is that neither of
thebehaviors
 
described thus far sounds very appealing as a feature, though I'm
certain there are some people who, with sufficient jiggering, could
make effective use of them.

...Robert


Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Thu, 2008-09-11 at 16:19 +0300, Heikki Linnakangas wrote:
> Well, yes, but you can fall behind indefinitely that way. Imagine that 
> each transaction on the slave lasts, say 10 minutes, with a new 
> transaction starting every 5 minutes. On the master, there's a table 
> that's being vacuumed (or HOT-updated) frequently, say after each 
> transaction for simplicity. What can happen is that every transaction 
> that finishes on the slave will only let the WAL replay advance by one 
> XID before blocking on the snapshot of the next slave transaction. The 
> WAL replay will advance at a rate of 0.2 TPM, while the master is 
> generating 1.0 TPM.

Aha, now I see where I was mistaken... I thought in terms of time and
not transaction IDs. So the time distance between the slave transactions
does not matter at all, only the distance in recovered XIDs matter for
the "blocking horizon"... and if the WAL recovery is blocked, the
"blocking horizon" is stalled as well, so the next transaction on the
slave will in fact require the same "blocking horizon" as all currently
running ones. Now I got it... and that means in fact that if you have
continuously overlapping small transactions, the "blocking horizon"
could be even blocked forever, as there'll always be a query running,
and the new queries will always have the snapshot of the currently
running ones because WAL recovery is stalled... or at least that's what
I understand from the whole thing...

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Dimitri Fontaine
Date:
Le jeudi 11 septembre 2008, Heikki Linnakangas a écrit :
> Well, yes, but you can fall behind indefinitely that way. Imagine that
> each transaction on the slave lasts, say 10 minutes, with a new
> transaction starting every 5 minutes. On the master, there's a table
> that's being vacuumed (or HOT-updated) frequently, say after each
> transaction for simplicity. What can happen is that every transaction
> that finishes on the slave will only let the WAL replay advance by one
> XID before blocking on the snapshot of the next slave transaction. The
> WAL replay will advance at a rate of 0.2 TPM, while the master is
> generating 1.0 TPM.

What would forbid the slave to choose to replay all currently lagging WALs
each time it's given the choice to advance a little?
--
dim

Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Thu, 2008-09-11 at 15:33 +0200, Dimitri Fontaine wrote:
> What would forbid the slave to choose to replay all currently lagging WALs 
> each time it's given the choice to advance a little?

Well now that I think I understand what Heikki meant, I also think the
problem is that there's no choice at all to advance, because the new
queries will simply have the same snapshot as currently running ones as
long as WAL reply is blocked... further blocking the WAL reply. When
saying this I suppose that the snapshot is in fact based on the last
recovered XID, and not on any slave-local XID. In that case once WAL
recovery is blocked, the snapshot is stalled too, further blocking WAL
recovery, and so on...

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Dimitri Fontaine
Date:
Le jeudi 11 septembre 2008, Csaba Nagy a écrit :
> Well now that I think I understand what Heikki meant, I also think the
> problem is that there's no choice at all to advance, because the new
> queries will simply have the same snapshot as currently running ones as
> long as WAL reply is blocked... further blocking the WAL reply. When
> saying this I suppose that the snapshot is in fact based on the last
> recovered XID, and not on any slave-local XID. In that case once WAL
> recovery is blocked, the snapshot is stalled too, further blocking WAL
> recovery, and so on...

Well, it may be possible to instruct the WAL replay daemon to stop being
polite sometimes: when a given max_lag_delay is reached, it could take locks
and as soon as it obtains them, replay all remaining WAL.
The max_lag_delay would be a GUC allowing to set the threshold between
continuing the replay and running queries.

There could some other nice ideas without inventing yet another GUC, but this
one was eaiser to think about for me ;)
--
dim

Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Csaba Nagy wrote:
> and that means in fact that if you have
> continuously overlapping small transactions, the "blocking horizon"
> could be even blocked forever, as there'll always be a query running,
> and the new queries will always have the snapshot of the currently
> running ones because WAL recovery is stalled... 

Hmm, no I don't think the WAL recovery can become completely stalled. To 
completely stop progressing, we'd need to take a new snapshot that 
includes transaction X, and at the same time be blocked on a vacuum 
record that vacuums a tuple that's visible to transaction X. I don't 
think that can happen, because for such a scenario to arise, in the 
corresponding point in time in the master, there would've been a 
scenario where the vacuum would've removed a tuple that would have been  visible to a newly starting transaction. Which
can'thappen. I think..
 

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
Thanks for the detailed thinking. At least one very good new idea here,
some debate on other points.


On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:

> And still we can't escape the scenario that the slave receives a WAL 
> record that vacuums away a tuple that's still visible according to a 
> snapshot used in the slave. Even with the proposed scheme, this can happen:
> 
> 1. Slave receives a snapshot from master
> 2. A long-running transaction begins on the slave, using that snapshot
> 3. Network connection is lost
> 4. Master hits a timeout, and decides to discard the snapshot it sent to 
> the slave
> 5. A tuple visible to the snapshot is vacuumed
> 6. Network connection is re-established
> 7. Slave receives the vacuum WAL record, even though the long-running 
> transaction still needs the tuple.

Interesting point. (4) is a problem, though not for the reason you
suggest. If we were to stop and start master, that would be sufficient
to discard the snapshot that the standby is using and so cause problems.
So the standby *must* tell the master the recentxmin it is using, as you
suggest later, so good thinking. So part of the handshake between
primary and standby must be "what is your recentxmin?". The primary will
then use the lower/earliest of the two.

> I like the idea of acquiring snapshots locally in the slave much more.

Me too. We just need to know how, if at all.
> As you mentioned, the options there are to defer applying WAL, or cancel 
> queries. I think both options need the same ability to detect when 
> you're about to remove a tuple that's still visible to some snapshot, 
> just the action is different. We should probably provide a GUC to 
> control which you want.

I don't see any practical way of telling whether a tuple removal will
affect a snapshot or not. Each removed row would need to be checked
against each standby snapshot. Even if those were available, it would be
too costly. And even if we can do that, ISTM that neither option is
acceptable: if we cancel queries then touching a frequently updated
table is nearly impossible, or if we delay applying WAL then the standby
could fall behind, impairing its ability for use in HA. (If there was a
way, yes, we should have a parameter for it).

It was also suggested we might take the removed rows and put them in a
side table, but that makes me think of the earlier ideas for HOT and so
I've steered clear of that.

You might detect blocks that have had tuples removed from them *after* a
query started by either
* keeping a hash table of changed blocks - it would be a very big data
structure and hard to keep clean
* adding an additional "last cleaned LSN" onto every data block
* keeping an extra LSN on the bufhdr for each of the shared_buffers,
plus keeping a hash table of blocks that have been cleaned and then
paged out
Once detected, your only option is to cancel the query.

ISTM if we want to try to avoid making recentxmin same on both primary
and standby then the only viable options are the 3 on the original post.

> However, if we still to provide the behavior that "as long as the 
> network connection works, the master will not remove tuples still needed 
> in the slave" as an option, a lot simpler implementation is to 
> periodically send the slave's oldest xmin to master. Master can take 
> that into account when calculating its own oldest xmin. That requires a 
> lot less communication than the proposed scheme to send snapshots back 
> and forth. A softer version of that is also possible, where the master 
> obeys the slave's oldest xmin, but only up to a point.

I like this very much. Much simpler implementation and no need for a
delay in granting snapshots. I'll go for this as the default
implementation. Thanks for the idea.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> So part of the handshake between
> primary and standby must be "what is your recentxmin?". The primary will
> then use the lower/earliest of the two.

Even then, the master might already have vacuumed away tuples that are 
visible to an already running transaction in the slave, before the slave 
connects. Presumably the master doesn't wait for the slave to connect 
before starting to accept new connections.

>> As you mentioned, the options there are to defer applying WAL, or cancel 
>> queries. I think both options need the same ability to detect when 
>> you're about to remove a tuple that's still visible to some snapshot, 
>> just the action is different. We should probably provide a GUC to 
>> control which you want.
> 
> I don't see any practical way of telling whether a tuple removal will
> affect a snapshot or not. Each removed row would need to be checked
> against each standby snapshot. Even if those were available, it would be
> too costly. 

How about using the same method as we use in HeapTupleSatisfiesVacuum? 
Before replaying a vacuum record, look at the xmax of the tuple 
(assuming it committed). If it's < slave's OldestXmin, it can be 
removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's 
conservative, but doesn't require any extra bookkeeping.

And vice versa: if we implement the more precise book-keeping, with all 
snapshots in shared memory or something, we might as well use it in 
HeapTupleSatisfiesVacuum. That has been discussed before, but it's a 
separate project.

> It was also suggested we might take the removed rows and put them in a
> side table, but that makes me think of the earlier ideas for HOT and so
> I've steered clear of that.

Yeah, that's non-trivial. Basically a whole new, different 
implementation of MVCC, but without changing any on-disk formats.

BTW, we haven't talked about how to acquire a snapshot in the slave. 
You'll somehow need to know which transactions have not yet committed, 
but will in the future. In the master, we keep track of in-progress 
transaction in the ProcArray, so I suppose we'll need to do the same in 
the slave. Very similar to prepared transactions, actually. I believe 
the Abort records, which are not actually needed for normal operation, 
become critical here. The slave will need to put an entry to ProcArray 
for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry 
at a Commit and Abort record. And clear them all at a shutdown record.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> BTW, we haven't talked about how to acquire a snapshot in the slave. You'll
> somehow need to know which transactions have not yet committed, but will in the
> future. 

I'm not sure why you need to know which ones will commit in the future. ISTM
you need the same information you normally have which is just which ones have
committed as of the point of WAL replay you're at.

However it does occur to me that if you know in advance that a transaction
will abort in the future you could skip applying its WAL records when you see
them. I'm not sure it's worth adding such an optimization though and it might
get weird around vacuum.

> In the master, we keep track of in-progress transaction in the ProcArray, so
> I suppose we'll need to do the same in the slave. Very similar to prepared
> transactions, actually. I believe the Abort records, which are not actually
> needed for normal operation, become critical here. The slave will need to
> put an entry to ProcArray for any new XLogRecord.xl_xid it sees in the WAL,
> and remove the entry at a Commit and Abort record. And clear them all at a
> shutdown record.

That's how I envisioned it when the topic came up, but another solution was
also bandied about -- I'm not sure who originally proposed it. The alternative
method is to have the master periodically insert a data structure describing a
snapshot in the WAL. The slave then keeps a copy of the last snapshot it saw
in the WAL and any new query which starts up uses that.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> 
>> BTW, we haven't talked about how to acquire a snapshot in the slave. You'll
>> somehow need to know which transactions have not yet committed, but will in the
>> future. 
> 
> I'm not sure why you need to know which ones will commit in the future.

Hmm, I phrased that badly. We need to know which transactions *might* 
commit in the future, IOW, are still in progress. Because we want to 
mark those as in-progress in the snapshots that are taken in the slave. 
Otherwise, when they do commit, they will suddenly become visible in the 
snapshots that didn't know that they were in progress.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Chris Browne
Date:
heikki.linnakangas@enterprisedb.com (Heikki Linnakangas) writes:
> Simon Riggs wrote:
>> Taking snapshots from primary has a few disadvantages
>>
>>  ...
>>       * snapshots on primary prevent row removal (but this was also an
>>         advantage of this technique!)
>
> That makes it an awful solution for high availability. A backend hung
> in transaction-in-progress state in the slave will prevent row removal
> on the master. Isolating the master from queries done performed in the
> slave is exactly the reason why people use hot standby. And running
> long reporting queries in the standby is again a very typical use case.

I agree that this is a demerit to this approach.

Whether or not, on balance, it makes it an 'awful solution for high
availability' is much more in the eye of the beholder, and NOT obvious
on the face of it.
-- 
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxfinances.info/info/sgml.html
Question: How many surrealists does it take to change a light bulb?

Answer: Two, one to hold the giraffe, and the other to fill the bathtub       with brightly colored machine tools.


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > So part of the handshake between
> > primary and standby must be "what is your recentxmin?". The primary will
> > then use the lower/earliest of the two.
> 
> Even then, the master might already have vacuumed away tuples that are 
> visible to an already running transaction in the slave, before the slave 
> connects. Presumably the master doesn't wait for the slave to connect 
> before starting to accept new connections.

Yep, OK.

> >> As you mentioned, the options there are to defer applying WAL, or cancel 
> >> queries. I think both options need the same ability to detect when 
> >> you're about to remove a tuple that's still visible to some snapshot, 
> >> just the action is different. We should probably provide a GUC to 
> >> control which you want.
> > 
> > I don't see any practical way of telling whether a tuple removal will
> > affect a snapshot or not. Each removed row would need to be checked
> > against each standby snapshot. Even if those were available, it would be
> > too costly. 
> 
> How about using the same method as we use in HeapTupleSatisfiesVacuum? 
> Before replaying a vacuum record, look at the xmax of the tuple 
> (assuming it committed). If it's < slave's OldestXmin, it can be 
> removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's 
> conservative, but doesn't require any extra bookkeeping.
> 
> And vice versa: if we implement the more precise book-keeping, with all 
> snapshots in shared memory or something, we might as well use it in 
> HeapTupleSatisfiesVacuum. That has been discussed before, but it's a 
> separate project.

Tuple removals earlier than the slave's OldestXmin are easy, thats true.
I'm not sure what you had in mind for "Otherwise not"? 

Maybe you mean "stop applying WAL until slave's OldestXmin is > tuple
removal xid". Not sure, reading other subthreads of this post.

I think its possible to defer removal actions on specific blocks only,
but that is an optimisation that's best left for a while.

BTW, tuple removals would need a cleanup lock on a block, just as they
do on master server. So WAL apply can be delayed momentarily by
pinholders anyway, whatever we do.

> > It was also suggested we might take the removed rows and put them in a
> > side table, but that makes me think of the earlier ideas for HOT and so
> > I've steered clear of that.
> 
> Yeah, that's non-trivial. Basically a whole new, different 
> implementation of MVCC, but without changing any on-disk formats.
> 
> BTW, we haven't talked about how to acquire a snapshot in the slave. 
> You'll somehow need to know which transactions have not yet committed, 
> but will in the future. In the master, we keep track of in-progress 
> transaction in the ProcArray, so I suppose we'll need to do the same in 
> the slave. Very similar to prepared transactions, actually. I believe 
> the Abort records, which are not actually needed for normal operation, 
> become critical here. The slave will need to put an entry to ProcArray 
> for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry 
> at a Commit and Abort record. And clear them all at a shutdown record.

I wouldn't do it like that.

I was going to maintain a "current snapshot" in shared memory, away from
the PROCARRAY. Each time we see a TransactionId we check whether its
already been seen, if not, insert it. When a transaction commits or
aborts we remove the stated xid. If we see a shutdown checkpoint we
clear the array completely. When query backends want a snapshot they
just read the array.  It doesn't matter whether queries commit or abort,
since those changes can't be seen anyway by queries until commit.

Reason for doing it this way is PROCARRAY may be full of query backends,
so having dummy backends in there as well sounds confusing.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 11:38 +0300, Hannu Krosing wrote:
> On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
> 
> > I like the idea of acquiring snapshots locally in the slave much more. 
> > As you mentioned, the options there are to defer applying WAL, or cancel 
> > queries. 
> 
> More exotic ways to defer applying WAL include using some smart
> filesystems to get per-backend data snapshots, using either
> copy-of-write overlay filesystems and filesystem or disk level
> snapshots.

That's certainly possible.

That would mean we maintain a single consistent viewpoint of the whole
database for a period of time. The frozen viewpoint could move forwards
by operator command, or we might keep multiple frozen views.

We can have a LookasideHash table in memory that keeps track of which
blocks have had rows removed from them since the "frozen view" was
taken.

If you request a block, we check to see whether there is a lookaside
copy of it prior to the tuple removals. We then redirect the block
request to a viewpoint relation's block. Each viewpoint gets a separate
relfilenode. We do the switcheroo while holding cleanup lock on block.

So effectively we would be adding "frozen snapshot" technology to
Postgres. Although we would want to do, copy-on-clean rather than
copy-on-write. Which could mean significantly better performance.

That might be encumbered by patent in some way.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > b) vacuum on the server which cleans up a tuple the slave has in scope has to
> >    block WAL reply on the slave (which I suppose defeats the purpose of having
> >    a live standby for users concerned more with fail-over latency).
> 
> One problem with this, BTW, is that if there's a continuous stream of 
> medium-length transaction in the slave, each new snapshot taken will 
> prevent progress in the WAL replay, so the WAL replay will advance in 
> "baby steps", and can fall behind indefinitely. As soon as there's a 
> moment that there's no active snapshot, it can catch up, but if the 
> slave is seriously busy, that might never happen.

It should be possible to do mixed mode.

Stall WAL apply for up to X seconds, then cancel queries. Some people
may want X=0 or low, others might find X = very high acceptable (Merlin
et al).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote:
> If you request a block, we check to see whether there is a lookaside
> copy of it prior to the tuple removals. We then redirect the block
> request to a viewpoint relation's block. Each viewpoint gets a separate
> relfilenode. We do the switcheroo while holding cleanup lock on block.

Wouldn't it make sense to also have a hint bit on the pages which are
copied away ? Then instead of looking up a hash table, you first would
look up that bit, and if not set you won't look up the hash table at
all.  Then when you clean up the "lookaside copies" you clear those bits
too...

That would probably perform somewhat better for reading than always
looking up a potentially big hash table, and the cost of setting the
hint is probably a lot less than copying away the page in the first
place. Resetting the hint bit might be a bit more expensive.

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 11:21 +0200, Csaba Nagy wrote:
> On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote:
> > If you request a block, we check to see whether there is a lookaside
> > copy of it prior to the tuple removals. We then redirect the block
> > request to a viewpoint relation's block. Each viewpoint gets a separate
> > relfilenode. We do the switcheroo while holding cleanup lock on block.
> 
> Wouldn't it make sense to also have a hint bit on the pages which are
> copied away ? 

There is no need for hint bit (and one bit would not be enough anyway,
as we may need to keep multiple versions)

A LSN of last row-removal (VACUUM or HOT-pruning) would serve as a hint
to start digging around in hash tables.

It seems though , that you may have to look into several disk snapshots
to find the page you need.

> Then instead of looking up a hash table, you first would
> look up that bit, and if not set you won't look up the hash table at
> all.  Then when you clean up the "lookaside copies" you clear those bits
> too...
> 
> That would probably perform somewhat better for reading than always
> looking up a potentially big hash table, and the cost of setting the
> hint is probably a lot less than copying away the page in the first
> place. Resetting the hint bit might be a bit more expensive.
> 
> Cheers,
> Csaba.
> 
> 



Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote:
> On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
> > Gregory Stark wrote:
> > > b) vacuum on the server which cleans up a tuple the slave has in scope has to
> > >    block WAL reply on the slave (which I suppose defeats the purpose of having
> > >    a live standby for users concerned more with fail-over latency).
> > 
> > One problem with this, BTW, is that if there's a continuous stream of 
> > medium-length transaction in the slave, each new snapshot taken will 
> > prevent progress in the WAL replay, so the WAL replay will advance in 
> > "baby steps", and can fall behind indefinitely. As soon as there's a 
> > moment that there's no active snapshot, it can catch up, but if the 
> > slave is seriously busy, that might never happen.
> 
> It should be possible to do mixed mode.
> 
> Stall WAL apply for up to X seconds, then cancel queries. Some people
> may want X=0 or low, others might find X = very high acceptable (Merlin
> et al).

Or even milder version.

* Stall WAL apply for up to X seconds, 
* then stall new queries, let old ones run to completion (with optional
fallback to canceling after Y sec), 
* apply WAL. 
* Repeat.

-------------
Hannu




Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 15:17 +0200, Jochem van Dieten wrote:
> On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote:
> > Transaction snapshots is probably the most difficult problem for Hot
> > Standby to resolve.
> 
> >      * remotely from primary node
> >      * locally on the standby node
> >
> > If we derive a snapshot locally, then we will end up with a situation
> > where the xmin of the local snapshot precedes the xmin of the primary
> > node. When this occurs it will then be possible for WAL records to
> > arrive on the standby that request removal of rows that a transaction
> > might wish to see. Preventing that situation can be done by either
> > deferring WAL apply or by cancelling queries.
> 
> Which operations can request row removal? Isn't that just specific
> operations that have their own 'is this save to remove' calculations
> anyway (i.e. vacuum and HOT prune)?
> 
> What I am thinking about is a design where the primary node were to
> regularly push an OldestXMin into the WAL, the WAL apply process on
> the standby nodes pushes it into shared memory and the backends keep
> an OldestMasterXMin in shared memory. The standby nodes then regularly
> pushes back the oldest OldestMasterXMin from all backends to the
> master. Vacuum and HOT prune could then base their calculations on an
> OldestXMin that is not the OldestXMin of the master itself, but of the
> master and the standby nodes. That way removal of records that are
> still visible on one of the standby nodes is prevented on the master
> instead of worked around on the standby nodes.
> 
> The obvious downside would be bloat on the master (which could get out
> of hand if a slave is a few days behind due to a long report), but I
> think in terms of visibility and consistency this would work. Or am I
> completely misunderstanding the problem?

Yes, just sending the xmin from standby to primary is the best way of
having primary and standby work together. I will include this, thanks.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 17:04 +0300, Heikki Linnakangas wrote:
> Csaba Nagy wrote:
> > and that means in fact that if you have
> > continuously overlapping small transactions, the "blocking horizon"
> > could be even blocked forever, as there'll always be a query running,
> > and the new queries will always have the snapshot of the currently
> > running ones because WAL recovery is stalled... 
> 
> Hmm, no I don't think the WAL recovery can become completely stalled. To 
> completely stop progressing, we'd need to take a new snapshot that 
> includes transaction X, and at the same time be blocked on a vacuum 
> record that vacuums a tuple that's visible to transaction X. I don't 
> think that can happen, because for such a scenario to arise, in the 
> corresponding point in time in the master, there would've been a 
> scenario where the vacuum would've removed a tuple that would have been 
>   visible to a newly starting transaction. Which can't happen. I think..

ISTM Csaba is correct. If WAL blocks the xids don't change and so the
snapshots never change, so wal is blocked. 

The only way out of that is to store up removals for particular blocks,
but that's complicated.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote:
> Transaction snapshots is probably the most difficult problem for Hot
> Standby to resolve. 

In summary of thread so far:

When queries on standby run for significantly longer than longest
queries on primary, some problems can occur. Various people have argued
for these responses to the problems:

1. Master uses Standby's OldestXmin
Effects: 
* Long running queries on standby...  Can delay row removal on primary  Do not delay apply of WAL records on standby
* Queries on standby give consistent answers in all cases.

2. Master ignores Standby's OldestXmin
Effects:
* Long running queries on standby...  Have no effect on primary  Can delay apply of WAL records on standby
* Queries on standby give consistent answers in all cases.

3. Ignore problem
Effects:
* Long running queries on standby...  Have no effect on primary  Do not delay apply of WAL records on standby
* Queries on standby give inconsistent answers in some cases, though
doesn't generate any messages to show inconsistency occurred. Acceptable
for read-only and insert only tables only.

Hot Standby should provide all 3 responses as options.

(1) would be implemented by sending Standby OldestXmin to primary.
Snapshots would not be sent from primary, they will be derived locally
from transactions currently being applied.

(2) would be implemented by setting a timer. When Startup process has
waited for more than "redo_max_delay"/"max_lag_delay" (SIGHUP) we cancel
queries. If timeout is 0 we aggressively cancel queries without a
timeout.

(3) would be implemented using read_consistency = on (default) | off, a
USERSET parameter. When read_consistency = off we ignore the backend's
xmin when deciding whether to wait before applying WAL or not.

Summary OK for everyone?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 12:18 +0100, Gregory Stark wrote:

> Is there any middle ground or brilliant ways to get the best of both worlds?

Possibly. Nobody has commented yet on the other ideas on the post
itself.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 11:19 +0100, Simon Riggs wrote:
> On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote:
> > Transaction snapshots is probably the most difficult problem for Hot
> > Standby to resolve. 
> 
> In summary of thread so far:
> 
> When queries on standby run for significantly longer than longest
> queries on primary, some problems can occur. Various people have argued
> for these responses to the problems:
> 
> 1. Master uses Standby's OldestXmin
...
> 2. Master ignores Standby's OldestXmin
...
> 3. Ignore problem
...

> Summary OK for everyone?

Maybe we should at least mention option 4.

4. Slave keeps copies of removed pages or rows when WAL apply removes
old versions .

Possible ways to do this
* inside Slave - have some backup store tied to OldestXmin intervals
 * variant 1 - have one global store, accessed through shared mem * variant 2 - present removed pages to interested
backendsand                let them (decide to) keep them 
 
* outside Slave - having file system keep old snapshots as long as
needed, still must tie to OldestXmin intervals, but most of work done by
storage layer (SAN or overlay file system).

-------------
Hannu




Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 13:53 +0300, Hannu Krosing wrote:

> 4. Slave keeps copies of removed pages or rows when WAL apply removes
> old versions .
> 
> Possible ways to do this
> 
>  * inside Slave - have some backup store tied to OldestXmin intervals
> 
>   * variant 1 - have one global store, accessed through shared mem
>   * variant 2 - present removed pages to interested backends and 
>                 let them (decide to) keep them 
> 
>  * outside Slave - having file system keep old snapshots as long as
> needed, still must tie to OldestXmin intervals, but most of work done by
> storage layer (SAN or overlay file system).

Possible options for "outside Slave" filesystem snapshooting - 

ZFS ( http://en.wikipedia.org/wiki/ZFS ) is very likely usable 

Linux LVM + XFS may be usable -
http://arstechnica.com/articles/columns/linux/linux-20041013.ars

Possibly also http://en.wikipedia.org/wiki/Btrfs .

----------------
Hannu




Re: Transaction Snapshots and Hot Standby

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> 3. Ignore problem
> Effects:
> * Long running queries on standby...
>    Have no effect on primary
>    Do not delay apply of WAL records on standby
> * Queries on standby give inconsistent answers in some cases, though
> doesn't generate any messages to show inconsistency occurred. Acceptable
> for read-only and insert only tables only.

This seems like a non-starter. 

Your comment about read-only and insert-only tuples only seems to make sense
if you assume there are other tables being updated simultaneously. Otherwise
of course there would be no WAL records for tuple removals.

In that case the problem is dealing with different usage patterns on different
tables. There might be a way to solve just that use case such as deferring WAL
records for those tables. That doesn't guarantee inter-table data consistency
if there were other queries which read from those tables and updated other
tables based on that data though. Perhaps there's a solution for that too
though.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Transaction Snapshots and Hot Standby

From
Richard Huxton
Date:
Gregory Stark wrote:
> In that case the problem is dealing with different usage patterns on different
> tables. There might be a way to solve just that use case such as deferring WAL
> records for those tables. That doesn't guarantee inter-table data consistency
> if there were other queries which read from those tables and updated other
> tables based on that data though. Perhaps there's a solution for that too
> though.

There was a suggestion (Simon - from you?) of a transaction voluntarily
restricting itself to a set of tables. That would obviously reduce the
impact of all the options where the accessed tables weren't being
updated (where update = vacuum + HOT if I've got this straight).

--  Richard Huxton Archonet Ltd


Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 12:31 +0300, Hannu Krosing wrote:
> On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote:
> > On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
> > > Gregory Stark wrote:
> > > > b) vacuum on the server which cleans up a tuple the slave has in scope has to
> > > >    block WAL reply on the slave (which I suppose defeats the purpose of having
> > > >    a live standby for users concerned more with fail-over latency).
> > > 
> > > One problem with this, BTW, is that if there's a continuous stream of 
> > > medium-length transaction in the slave, each new snapshot taken will 
> > > prevent progress in the WAL replay, so the WAL replay will advance in 
> > > "baby steps", and can fall behind indefinitely. As soon as there's a 
> > > moment that there's no active snapshot, it can catch up, but if the 
> > > slave is seriously busy, that might never happen.
> > 
> > It should be possible to do mixed mode.
> > 
> > Stall WAL apply for up to X seconds, then cancel queries. Some people
> > may want X=0 or low, others might find X = very high acceptable (Merlin
> > et al).
> 
> Or even milder version.
> 
> * Stall WAL apply for up to X seconds, 
> * then stall new queries, let old ones run to completion (with optional
> fallback to canceling after Y sec), 
> * apply WAL. 
> * Repeat.

Now that I have thought a little more about delegating keeping old
versions to filesystem level (ZFS , XFS+LVM) snapshots I'd like to
propose the following:
0. run queries and apply WAL freely until WAL application would    remove old rows.
1. stall applying WAL for up to N seconds
2. stall starting new queries for up to M seconds
3.  if some backends are still running long queries, then 
  3.1. make filesystem level snapshot (FS snapshot),   3.2. mount the FS snapshot somewhere (maybe as
data.at.OldestXmin       in parallel to $PGDATA) and   3.3 hand this mounted FS snapshot over to those backends
 
4. apply WAL
5. GoTo 0.

Of course we need to do the filesystem level snapshots in 3. only if the
long-running queries don't already have one given to them. Or maybe also
if they are running in READ COMMITTED mode and and have aquired a new PG
snapshot since they got their FS snapshot need a new one.

Also, snapshots need to be reference counted, so we can unmount and
destroy them once all their users have finished.

I think that enabling long-running queries this way is both low-hanging
fruit (or at least medium-height-hanging ;) ) and also consistent to
PostgreSQL philosophy of not replication effort. As an example we trust
OS's file system cache and don't try to write our own.

----------------
Hannu











Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Fri, 2008-09-12 at 12:31 +0100, Richard Huxton wrote:
> There was a suggestion (Simon - from you?) of a transaction voluntarily
> restricting itself to a set of tables.

While thinking about how easy it would be for the DBA to specify the set
of tables a single query is accessing, first I thought that it should be
straight enough to look at the query itself for that. Then I thought
what about views, rules, triggers, user functions etc. ? All those have
the potential to access more than you see in the query itself. And then
the actually interesting question: what will the slave do with views,
rules, triggers ? I guess triggers are out of the question to be
executed, what about rules ? Probably must be also ignored... user
functions will probably get errors if they try to update something...
Views should probably function correctly.

So in any case the functionality available for querying slaves would be
less than for the primary. This is probably good enough for most
purposes...

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
> I think that enabling long-running queries this way is both
> low-hanging
> fruit (or at least medium-height-hanging ;) ) and also consistent to
> PostgreSQL philosophy of not replication effort. As an example we trust
> OS's file system cache and don't try to write our own.

I have again questions (unfortunately I only have questions usually):

* how will the buffers keep 2 different versions of the same page ?
* how will you handle the creation of snapshots ? I guess there's no portable and universal API for that (just
guessing),or there is some POSIX thing which is supported or not by the specific FS ? So if the FS is not supporting
it,you skip the snapshot step ? And if there's no universal API, will it be handled by plugins providing a specified
APIfor snapshotting the FS ?
 

I hope my continuous questioning is not too annoying...

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Fri, 2008-09-12 at 12:25 +0100, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > 3. Ignore problem
> > Effects:
> > * Long running queries on standby...
> >    Have no effect on primary
> >    Do not delay apply of WAL records on standby
> > * Queries on standby give inconsistent answers in some cases, though
> > doesn't generate any messages to show inconsistency occurred. Acceptable
> > for read-only and insert only tables only.
> 
> This seems like a non-starter. 

It works, and is proposed as a non-default option since a number of
people have independently said to me that this would be
acceptable/preferred.

> Your comment about read-only and insert-only tuples only seems to make sense
> if you assume there are other tables being updated simultaneously. Otherwise
> of course there would be no WAL records for tuple removals.

Yeh, you got it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Hannu Krosing
Date:
On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote:
> > I think that enabling long-running queries this way is both
> > low-hanging
> > fruit (or at least medium-height-hanging ;) ) and also consistent to
> > PostgreSQL philosophy of not replication effort. As an example we trust
> > OS's file system cache and don't try to write our own.
> 
> I have again questions (unfortunately I only have questions usually):
> 
> * how will the buffers keep 2 different versions of the same page ?

As the FS snapshot is mounted as a different directory, it will have
it's own buffer pages.

To conserve RAM, one could go to FS snapshot files only in case main
pages have LSN too big to be trusted.

> * how will you handle the creation of snapshots ? 

probably an external command, possibly shell script. 
similar to current "archive_command" for wal copying

maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command'

> I guess there's no portable and universal API for that (just guessing), 
> or there is some POSIX thing which is supported or not by the specific FS ? 
> So if the FS is not supporting it, you skip the snapshot step ? 

Yes, if not FS snapshots are not supported, we fall back to either
inconsistent read or killing long-running queries.

> And if there's no universal API, will it be handled by plugins providing 
> a specified API for snapshotting the FS ?

Yes, the simplest one being external command. As FS snapshots are
supposed to happen not-too often, at least not every second, just having
external commands may be enough.

> I hope my continuous questioning is not too annoying...

On the contrary, much appreciated. :)

----------------
Hannu




Re: Transaction Snapshots and Hot Standby

From
Csaba Nagy
Date:
On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote:
> > * how will the buffers keep 2 different versions of the same page ?
> 
> As the FS snapshot is mounted as a different directory, it will have
> it's own buffer pages.

Lack of knowledge about this shows my ignorance about the implementation
of the page buffers...

> > * how will you handle the creation of snapshots ? 
> 
> probably an external command, possibly shell script. 
> similar to current "archive_command" for wal copying
> 
> maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command'
[snip]
> Yes, the simplest one being external command. As FS snapshots are
> supposed to happen not-too often, at least not every second, just having
> external commands may be enough.

You could restrict the creation of snapshots to some minimum amount of
time between them, and maybe also restrict the maximum number of
concurrent snapshots possible. Then if the time limit (as calculated
from the last open snapshot) is currently not met, any new query could
reuse that last snapshot. The time intervals do not need to be evenly
distributed BTW, it could be a function of the already opened snapshots,
like increase the minimum interval exponentially with the number of
already opened snapshots. That would help to catch more long running
queries to just a few snapshots.

> > I hope my continuous questioning is not too annoying...
> 
> On the contrary, much appreciated. :)

Ok, then I'll continue :-) I would like to see this feature succeed, but
there's slim chance I'll ever code well in C...

Cheers,
Csaba.




Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote:
> On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote:
> > > I think that enabling long-running queries this way is both
> > > low-hanging
> > > fruit (or at least medium-height-hanging ;) ) and also consistent to
> > > PostgreSQL philosophy of not replication effort. As an example we trust
> > > OS's file system cache and don't try to write our own.
> > 
> > I have again questions (unfortunately I only have questions usually):
> > 
> > * how will the buffers keep 2 different versions of the same page ?
> 
> As the FS snapshot is mounted as a different directory, it will have
> it's own buffer pages.

RelFileNode has a spcNode which can be redirected to a temporary
filesystem snapshot. So its relatively easy to imagine redirecting
access to a table from its normal tablespace to the snapshot one.

> To conserve RAM, one could go to FS snapshot files only in case main
> pages have LSN too big to be trusted.

That would mean you'd need to do two I/Os, one to get the newly changed
page to get its LSN and another to get the old COW copy. We might waste
buffer space with that technique also. Since we'd be trying to avoid
cacheing bigger tables anyway (since 8.3) it seems easier to just go
straight to the COW copy.

So I think its fairly straightforward to support temporary snapshots in
Postgres, with creation/destruction handled in the way you say.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> BTW, we haven't talked about how to acquire a snapshot in the slave.
>  You'll somehow need to know which transactions have not yet
> committed, but will in the future. In the master, we keep track of
> in-progress transaction in the ProcArray, so I suppose we'll need to
> do the same in the slave. Very similar to prepared transactions,
> actually. I believe the Abort records, which are not actually needed
> for normal operation, become critical here. The slave will need to
> put an entry to ProcArray for any new XLogRecord.xl_xid it sees in
> the WAL, and remove the entry at a Commit and Abort record. And clear
> them all at a shutdown record.

For reference, here is how I solved the snapshot problem in my
Summer-of-Code project last year, which dealt exactly with executing
read-only queries on PITR slaves (But sadly never came out of alpha
stage due to both my and Simon's lack of time)

The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering "in-progress", the array contained all the xid's >
xmin that are to be considered "completed".

The current read-only snapshot (which "current" meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries < xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache->nextXid.

The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...

regards, Florian Pflug


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:

> The main idea was to invert the meaning of the xid array in the snapshot
> struct - instead of storing all the xid's between xmin and xmax that are
> to be considering "in-progress", the array contained all the xid's >
> xmin that are to be considered "completed".

> The downside is that the size of the read-only snapshot is theoretically
> unbounded, which poses a bit of a problem if it's supposed to live
> inside shared memory...

Why do it inverted? That clearly has problems.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:

> The current read-only snapshot (which "current" meaning the
> corresponding state on the master at the time the last replayed wal
> record was generated) was maintained in shared memory. It' xmin field
> was continually updated with the (newly added) XLogRecord.xl_xmin
> field, which contained the xid of the oldest running query on the
> master, with a pruning step after each ReadOnlySnapshot.xmin update to
> remove all entries < xmin from the xid array. If a commit was seen for
> an xid, that xid was added to the ReadOnlySnapshot.xid array.
> 
> The advantage of this concept is that it handles snapshotting on the
> slave without too much additional work for the master (The only change
> is the addition of the xl_xmin field to XLogRecord). It especially
> removes that need to track ShmemVariableCache->nextXid.

Snapshots only need to know which transactions are currently "running"
during WAL apply. The standby can't remove any tuples itself, so it
doesn't need to know what the master's OldestXmin is. 

So passing xl_xmin from master to standby seems not necessary to me. The
standby's OldestXmin needs to be passed through to the master, not the
other way around so that WAL records for tuple removal are not
generated.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> So passing xl_xmin from master to standby seems not necessary to me. The
> standby's OldestXmin needs to be passed through to the master, not the
> other way around so that WAL records for tuple removal are not
> generated.

I think most people were pretty leery of doing it that way because it means
activity on the standby database can cause the master to bloat. The consensus
seemed to be headed towards having WAL replay on the standby stall if it meets
a tuple removal record for a tuple that's visible to a query running on it.
Probably with a mechanism to configure a maximum amount of time it can be
stalled before shooting those queries.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > So passing xl_xmin from master to standby seems not necessary to me. The
> > standby's OldestXmin needs to be passed through to the master, not the
> > other way around so that WAL records for tuple removal are not
> > generated.
> 
> I think most people were pretty leery of doing it that way because it means
> activity on the standby database can cause the master to bloat. The consensus
> seemed to be headed towards having WAL replay on the standby stall if it meets
> a tuple removal record for a tuple that's visible to a query running on it.
> Probably with a mechanism to configure a maximum amount of time it can be
> stalled before shooting those queries.

Well, my impression from all inputs is there is no single preferred
route. Any one of the approaches seems to have a possible objection,
depending upon the needs of the user. So I'm going to give options.

In any case it's not just a two horse race. There are other options,
favoured by some people, that you've not personally commented on in any
of your summaries (thats up to you, of course).

And "Standby causing master to bloat" is not such a big problem. It's no
different to running queries directly on the master. But please don't
take it that I don't see the problem or think it the best solution in
all cases.

Certainly, halting WAL replay for any length of time might mean it can
never catch up again, so that won't be acceptable for many cases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
"Merlin Moncure"
Date:
On Mon, Sep 15, 2008 at 8:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>
>> > So passing xl_xmin from master to standby seems not necessary to me. The
>> > standby's OldestXmin needs to be passed through to the master, not the
>> > other way around so that WAL records for tuple removal are not
>> > generated.
>>
>> I think most people were pretty leery of doing it that way because it means
>> activity on the standby database can cause the master to bloat. The consensus
>> seemed to be headed towards having WAL replay on the standby stall if it meets
>> a tuple removal record for a tuple that's visible to a query running on it.
>> Probably with a mechanism to configure a maximum amount of time it can be
>> stalled before shooting those queries.
>
> Well, my impression from all inputs is there is no single preferred
> route. Any one of the approaches seems to have a possible objection,
> depending upon the needs of the user. So I'm going to give options.
>
> In any case it's not just a two horse race. There are other options,
> favoured by some people, that you've not personally commented on in any
> of your summaries (thats up to you, of course).
>
> And "Standby causing master to bloat" is not such a big problem. It's no
> different to running queries directly on the master. But please don't
> take it that I don't see the problem or think it the best solution in
> all cases.

It's not a problem, but a relative disadvantage.  What makes warm
standby really attractive relative to other data transfer solutions is
that there are no side effects on the master outside of setting the
archive command...communication is one way.  Any 'master bloat' style
approach seems to be increasingly fragile if you want to increase the
number of standby servers, if it's even possible to do that.

I'm not saying it should be the only way to do it (and it may not even
be possible)...but it would be very attractive to be able to run a hot
standby much the same as a warm standby is running today...I could,
for example easily script a second standby but keep it a week behind
for example.

merlin


Re: Transaction Snapshots and Hot Standby

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
>> So passing xl_xmin from master to standby seems not necessary to me. The
>> standby's OldestXmin needs to be passed through to the master, not the
>> other way around so that WAL records for tuple removal are not
>> generated.
> 
> I think most people were pretty leery of doing it that way because it means
> activity on the standby database can cause the master to bloat. The consensus
> seemed to be headed towards having WAL replay on the standby stall if it meets
> a tuple removal record for a tuple that's visible to a query running on it.
> Probably with a mechanism to configure a maximum amount of time it can be
> stalled before shooting those queries.

Yes, but we can quite easily provide an option on top of that to 
advertise the slave xmin back to the master, for those who prefer some 
bloat on master over stalling replay or killing queries in the slave. In 
fact, I think a lot of people will choose some compromise, where the 
slave xmin is advertised back to the master, but the master will obey it 
only up to some limit, after which the slave will need to stall or kill 
queries again.

It's not something that needs to be done in the first phase, but should 
be straightforward to add after the basics are working. In any case, 
we'll need the capability in the slave to notice when it's about to 
remove a tuple that's still visible to a snapshot in the slave.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote:
> In any case, we'll need the capability in the slave to notice when
> it's about to remove a tuple that's still visible to a snapshot in the
> slave.

Agreed.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Mon, 2008-09-15 at 09:07 -0400, Merlin Moncure wrote:
> > Well, my impression from all inputs is there is no single preferred
> > route. Any one of the approaches seems to have a possible objection,
> > depending upon the needs of the user. So I'm going to give options.
> >
> > In any case it's not just a two horse race. There are other options,
> > favoured by some people, that you've not personally commented on in
> any
> > of your summaries (thats up to you, of course).
> >
> > And "Standby causing master to bloat" is not such a big problem. It's no
> > different to running queries directly on the master. But please don't
> > take it that I don't see the problem or think it the best solution in
> > all cases.
> 
> It's not a problem, but a relative disadvantage.  

In some circumstances, yes, in others, not. But its not the only
consideration and different people attach different weightings in their
decision making. 

> What makes warm
> standby really attractive relative to other data transfer solutions is
> that there are no side effects on the master outside of setting the
> archive command...communication is one way.  Any 'master bloat' style
> approach seems to be increasingly fragile if you want to increase the
> number of standby servers, if it's even possible to do that.
> 
> I'm not saying it should be the only way to do it (and it may not even
> be possible)...but it would be very attractive to be able to run a hot
> standby much the same as a warm standby is running today...I could,
> for example easily script a second standby but keep it a week behind
> for example.

I hope to provide options for all users, not just a single approach. If
you choose never to use the option to link standby and master, I respect
that and understand. I've listened to your requirements and believe I'm
including things to allow it to work for you the way you've said.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote:
> In any case, 
> we'll need the capability in the slave to notice when it's about to 
> remove a tuple that's still visible to a snapshot in the slave.

Looks like what I'll do is this:

Alter functions in pruneheap.c so that we bubble up the latest xid that
is being removed as part of block cleaning. We then add that xid into
the WAL record for cleaning.

If latest xid of clean is ahead of oldestxmin of running queries on
standby then Startup process needs to take action, of some kind.

Re-examining the tuples in WAL apply seems bad plan, since we'd have to
touch stuff in the block twice and juggle the locks.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Robert Treat
Date:
On Friday 12 September 2008 07:44:36 Csaba Nagy wrote:
> And then
> the actually interesting question: what will the slave do with views,
> rules, triggers ? I guess triggers are out of the question to be
> executed, what about rules ? Probably must be also ignored... user
> functions will probably get errors if they try to update something...
> Views should probably function correctly.
>

If we dont have rules, we dont get views, so those need to be. Really we 
should allow anything that would work in the context of a read only 
transaction. (ie. functions that dont change anything should be fine to call)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Transaction Snapshots and Hot Standby

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:
> 
>> The main idea was to invert the meaning of the xid array in the snapshot
>> struct - instead of storing all the xid's between xmin and xmax that are
>> to be considering "in-progress", the array contained all the xid's >
>> xmin that are to be considered "completed".
> 
>> The downside is that the size of the read-only snapshot is theoretically
>> unbounded, which poses a bit of a problem if it's supposed to live
>> inside shared memory...
> 
> Why do it inverted? That clearly has problems.

Because it solves the problem of "sponteaously" apprearing XIDs in the 
WAL. At least prior to 8.3 with virtual xids, a transaction might have 
allocated it's xid long before actually writing anything to disk, and 
therefore long before this XID ever shows up in the WAL. And with a 
non-inverted snapshot such an XID would be considered to be "completed" 
by transactions on the slave... So, one either needs to periodically log 
a snapshot on the master or log XID allocations which both seem to cause 
considerable additional load on the master. With an inverted snapshot, 
it's sufficient to log the current RecentXmin - a values that is readily 
available on the master, and therefore the cost amounts to just one 
additional 4-byte field per xlog entry.

regards, Florian Pflug


Re: Transaction Snapshots and Hot Standby

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:
> 
>> The current read-only snapshot (which "current" meaning the
>> corresponding state on the master at the time the last replayed wal
>> record was generated) was maintained in shared memory. It' xmin field
>> was continually updated with the (newly added) XLogRecord.xl_xmin
>> field, which contained the xid of the oldest running query on the
>> master, with a pruning step after each ReadOnlySnapshot.xmin update to
>> remove all entries < xmin from the xid array. If a commit was seen for
>> an xid, that xid was added to the ReadOnlySnapshot.xid array.
>>
>> The advantage of this concept is that it handles snapshotting on the
>> slave without too much additional work for the master (The only change
>> is the addition of the xl_xmin field to XLogRecord). It especially
>> removes that need to track ShmemVariableCache->nextXid.
> 
> Snapshots only need to know which transactions are currently "running"
> during WAL apply. The standby can't remove any tuples itself, so it
> doesn't need to know what the master's OldestXmin is. 

I used the logged xmin value to track the shared snapshot's xmin, which 
in turn allowed me to prune the xid array, eliminating all xids < that xmin.

regards, Florian Pflug


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Mon, 2008-09-15 at 19:20 +0100, Florian G. Pflug wrote:
> Simon Riggs wrote:
> > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:
> > 
> >> The main idea was to invert the meaning of the xid array in the snapshot
> >> struct - instead of storing all the xid's between xmin and xmax that are
> >> to be considering "in-progress", the array contained all the xid's >
> >> xmin that are to be considered "completed".
> > 
> >> The downside is that the size of the read-only snapshot is theoretically
> >> unbounded, which poses a bit of a problem if it's supposed to live
> >> inside shared memory...
> > 
> > Why do it inverted? That clearly has problems.
> 
> Because it solves the problem of "sponteaously" apprearing XIDs in the 
> WAL. At least prior to 8.3 with virtual xids, a transaction might have 
> allocated it's xid long before actually writing anything to disk, and 
> therefore long before this XID ever shows up in the WAL. And with a 
> non-inverted snapshot such an XID would be considered to be "completed" 
> by transactions on the slave... So, one either needs to periodically log 
> a snapshot on the master or log XID allocations which both seem to cause 
> considerable additional load on the master. With an inverted snapshot, 
> it's sufficient to log the current RecentXmin - a values that is readily 
> available on the master, and therefore the cost amounts to just one 
> additional 4-byte field per xlog entry.

I think I understand what you're saying now, though I think it
mostly/only applies before your brilliant idea in 8.3.

If we have a transaction history that looks like this:

ReadA, WriteB, WriteA (both commit in either order)

then pre-8.3 we would have xidA < xidB, whereas at 8.3 and above we see
that xidA is actually higher than xidB. Now, TransactionIds are assigned
in the order of their first page write and *not* in the order of
transaction start as was previously the case, which isn't very obvious.

So when we replay WAL, we know that WAL is only written with locks held,
so that WriteA and WriteB must be independent of each other. So that
means the locks held by Read1 can be ignored and we can assume that the
above history is the same as

WriteB, WriteA

So if we took a snapshot in the middle of WriteB we would be safe to say
that only transaction B was in progress and that transaction A was not
yet started. So the snapshot we derive on the standby is different to
the one we would have derived on the client, yet the serializable order
is the same. In general, this means that all reads on a transaction
prior to the first write can be reordered later so that they can be
assumed to occur exactly prior to the first write of a transaction.
(Please shoot me down, if incorrect).

So when we see the first WAL record of a transaction we know that there
are no in progress transactions with a *lower* xid that we have not yet
seen. So we cannot be confused about whether a transaction is
in-progress, or not.

Almost. Now having written all of that I see there is an obvious race
condition between assignment of an xid and actions that result in the
acquisition of WALInsertLock. So even though the above seems mostly
correct, there is still a gap to plug, but a much smaller one.

So when writing the first WAL record for xid=48, it is possible that
xid=47 has just been assigned and is also just about to write a WAL
record. Thus the absence of a WAL record for xid=47 is not evidence that
xid=47 is complete because it was read-only.

We might handle this with the inverted technique you describe, but there
should be an easier way to track dense packing of xid sequence.

We expect xids to be written to WAL in the order assigned, so we might
check whether a newly assigned xid is 1 higher than the last highest
value to have inserted into WAL. If it is not, then we can write a short
WAL record to inform readers of WAL that the missing xids in sequence
are in progress also. So readers of WAL will "see" xids in the correct
sequence and are thus able to construct valid snapshots direct from WAL.

I think I should measure how often that occurs to see what problem or
overhead this might cause, if any.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Bruce Momjian
Date:
Simon Riggs wrote:
> 2. Master ignores Standby's OldestXmin
> Effects:
> * Long running queries on standby...
>    Have no effect on primary
>    Can delay apply of WAL records on standby
> * Queries on standby give consistent answers in all cases.

Just for clarification, if you set a max_slave_delay it means it is the
maximum amount of time WAL replay can be delayed on the slave, _and_ it
is the maximum amount of time a query/snapshot can be guaranteed to run
without the possibility of being canceled.  My point is that these two
concepts are linked to the same setting.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Transaction Snapshots and Hot Standby

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Simon Riggs wrote:
> > 2. Master ignores Standby's OldestXmin
> > Effects:
> > * Long running queries on standby...
> >    Have no effect on primary
> >    Can delay apply of WAL records on standby
> > * Queries on standby give consistent answers in all cases.
> 
> Just for clarification, if you set a max_slave_delay it means it is the
> maximum amount of time WAL replay can be delayed on the slave, _and_ it
> is the maximum amount of time a query/snapshot can be guaranteed to run
> without the possibility of being canceled.  My point is that these two
> concepts are linked to the same setting.

FYI, max_slave_delay does not relate to the amount of time of data loss
in the case of master failure, assuming the WAL files are stored on the
slave in some way during the delay.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Wed, 2008-09-24 at 21:19 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > 2. Master ignores Standby's OldestXmin
> > Effects:
> > * Long running queries on standby...
> >    Have no effect on primary
> >    Can delay apply of WAL records on standby
> > * Queries on standby give consistent answers in all cases.
> 
> Just for clarification, if you set a max_slave_delay it means it is the
> maximum amount of time WAL replay can be delayed on the slave, _and_ it
> is the maximum amount of time a query/snapshot can be guaranteed to run
> without the possibility of being canceled.  My point is that these two
> concepts are linked to the same setting.

I didn't even see them as separate, but now you mention it they could
be.

Startup process may need to wait up to max_slave_delay before applying
WAL, if it sees that it must cancel a query to continue. max_slave_delay
is set in postgresql.conf and honoured by the startup process.

What is not yet defined is whether max_slave_delay is a setting
per-blockage event, or a total time that could occur because of a single
long blockage or many smaller ones. The latter makes more sense, but
would only be sensible to calculate it when we have WAL streaming using
calculations similar to the bgwriter delay calculations. That definition
makes sense for the sysadmin because it is the amount of extra time a
standby could take to startup after a failover.

If we take max_slave_delay to mean the latter then there is no guarantee
for user queries, since we may already have used up all our time waiting
on one query and there may be no wait time left for the current user
query. It would depend heavily on the rows accessed, so queries might
easily run to completion even though they exceed the max_slave_delay.
That makes sense because if you are asking a question like "what is the
current total of widgets available" you understand that answer changes
quickly over time, whereas the "how many widgets were reordered on day
X" doesn't change over time at all.

If you define this as a "per wait event" setting it provides a guarantee
to the user queries, but also puts WAL apply into free-fall since you
can't control number of blockage points queries might cause.

>From what Merlin was saying, it would be sensible to have multiple
slaves: one with a max_slave_delay of 30 seconds to be your HA
fast-startup standby and another where we set max_slave_delay to 5 hours
to guarantee execution time for large reporting queries. Anyway, its
clear that max_slave_delay needs to be settable while running.

>From my perspective, all I can say is "all things are possible" and I
appreciate the need for options to satisfy different use cases. I'll
build the basic mechanics and then we can add fine tuning over time. So
I'll get the basic mechanics in place, suggest we observe how that works
and then decide on the control mechanisms because they sound relatively
simple to add.

Another important note is Hannu's suggestion of using snapshot
filesystems with Postgres. We can include that feature very quickly and
it will complement Hot Standby very well.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Zeugswetter Andreas OSB sIT
Date:
> Simon Riggs wrote:
> > 2. Master ignores Standby's OldestXmin
> > Effects:
> > * Long running queries on standby...
> >    Have no effect on primary
> >    Can delay apply of WAL records on standby
> > * Queries on standby give consistent answers in all cases.
>
> Just for clarification, if you set a max_slave_delay it means it is the
> maximum amount of time WAL replay can be delayed on the slave, _and_ it
> is the maximum amount of time a query/snapshot can be guaranteed to run
> without the possibility of being canceled.  My point is that these two
> concepts are linked to the same setting.

I wonder whether the cancel can be delayed until a tuple/page is actually accessed
that shows a too new xid.

The procedure would be like this:
Instead of cancel, the backend gets a message with a lsn_horizon.
From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels.
I think that should allow some more queries to complete.
Especially such that target static tables, or static parts of large tables
using appropriate index btree ranges that are also static.

Andreas


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-25 at 11:14 +0200, Zeugswetter Andreas OSB sIT wrote:
> > Simon Riggs wrote:
> > > 2. Master ignores Standby's OldestXmin
> > > Effects:
> > > * Long running queries on standby...
> > >    Have no effect on primary
> > >    Can delay apply of WAL records on standby
> > > * Queries on standby give consistent answers in all cases.
> >
> > Just for clarification, if you set a max_slave_delay it means it is the
> > maximum amount of time WAL replay can be delayed on the slave, _and_ it
> > is the maximum amount of time a query/snapshot can be guaranteed to run
> > without the possibility of being canceled.  My point is that these two
> > concepts are linked to the same setting.
> 
> I wonder whether the cancel can be delayed until a tuple/page is actually accessed
> that shows a too new xid.

Yes, its feasible and is now part of the design.

This is all about what happens *if* we need to remove rows that a query
can still see.

We might also make HOT and VACUUM slightly less aggressive at removing
rows on the master. We can't use later than OldestXmin for row removal,
but we might choose to use something earlier. That would be an
alternative to using the standby's OldestXmin.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Wed, 2008-09-24 at 21:22 -0400, Bruce Momjian wrote:
> Bruce Momjian wrote:
> > Simon Riggs wrote:
> > > 2. Master ignores Standby's OldestXmin
> > > Effects:
> > > * Long running queries on standby...
> > >    Have no effect on primary
> > >    Can delay apply of WAL records on standby
> > > * Queries on standby give consistent answers in all cases.
> > 
> > Just for clarification, if you set a max_slave_delay it means it is the
> > maximum amount of time WAL replay can be delayed on the slave, _and_ it
> > is the maximum amount of time a query/snapshot can be guaranteed to run
> > without the possibility of being canceled.  My point is that these two
> > concepts are linked to the same setting.
> 
> FYI, max_slave_delay does not relate to the amount of time of data loss
> in the case of master failure, assuming the WAL files are stored on the
> slave in some way during the delay.

Another way to handle row removal would be to check whether any users
have access to particular tables or not. If we could issue a permission
to prevent access to security definer functions, then we'd be able to
work out with certainty whether a row removal would ever be visible to
certain users. That would allow us to isolate fast changing tables from
slow changing ones.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Zeugswetter Andreas OSB sIT
Date:
> > I wonder whether the cancel can be delayed until a tuple/page is actually accessed
> > that shows a too new xid.
>
> Yes, its feasible and is now part of the design.
>
> This is all about what happens *if* we need to remove rows that a query
> can still see.

I was describing a procedure for exactly that case.

If a slave backend has a snapshot that we cannot guarantee any more
(because max_slave_delay has been exceeded):

> > Instead of cancel, the backend gets a message with a lsn_horizon.
> > From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels.

but not before (at the time max_slave_delay has been reached), as described earlier.

Andreas


Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-25 at 12:34 +0200, Zeugswetter Andreas OSB sIT wrote:
> > > I wonder whether the cancel can be delayed until a tuple/page is actually accessed
> > > that shows a too new xid.
> >
> > Yes, its feasible and is now part of the design.
> >
> > This is all about what happens *if* we need to remove rows that a query
> > can still see.
> 
> I was describing a procedure for exactly that case.

Ok, I see, sorry.

> If a slave backend has a snapshot that we cannot guarantee any more
> (because max_slave_delay has been exceeded):
> 
> > > Instead of cancel, the backend gets a message with a lsn_horizon.
> > > From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels.
> 
> but not before (at the time max_slave_delay has been reached), as described earlier.

Like that.

OK, so in full: 

Each WAL record that cleans tuples has a latestxid on it. If latestxid
is later than a running query on standby then we wait. When we stop
waiting we tell all at-risk queries the LSN of the first WAL record that
has potentially removed tuples they can see. If they see a block with a
higher LSN they cancel *themselves*. 

This works OK, since SeqScans never read blocks at end of file that
didn't exist when they started, so long queries need not be cancelled
when they access growing tables.

That combines all the suggested approaches into one. It still leaves the
possibility of passing the standby's OldestXmin to the primary, but does
not rely upon it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Transaction Snapshots and Hot Standby

From
Simon Riggs
Date:
On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote:

> BTW, we haven't talked about how to acquire a snapshot in the slave. 
> You'll somehow need to know which transactions have not yet committed,
> but will in the future. In the master, we keep track of in-progress 
> transaction in the ProcArray, so I suppose we'll need to do the same
> in the slave. Very similar to prepared transactions, actually. I
> believe the Abort records, which are not actually needed for normal
> operation, become critical here. The slave will need to put an entry
> to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, and
> remove the entry at a Commit and Abort record. And clear them all at a
> shutdown record.

Although I said differently earlier, it seems cleaner to make recovery
snapshots work by emulating ProcArray entries as you suggest. Fatal
errors were my problem there, but I think that's solvable now.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support