Thread: Hot standby and b-tree killed items

Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Whenever a B-tree index scan fetches a heap tuple that turns out to be 
dead, the B-tree item is marked as killed by calling _bt_killitems. When 
the page gets full, all the killed items are removed by calling 
_bt_vacuum_one_page.

That's a problem for hot standby. If any of the killed b-tree items 
point to a tuple that is still visible to a running read-only 
transaction, we have the same situation as with vacuum, and have to 
either wait for the read-only transaction to finish before applying the 
WAL record or kill the transaction.

It looks like there's some cosmetic changes related to that in the 
patch, the signature of _bt_delitems is modified, but there's no actual 
changes that would handle that situation. I didn't see it on the TODO on 
the hot standby wiki either. Am I missing something, or the patch?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 10:49 +0200, Heikki Linnakangas wrote:

> Whenever a B-tree index scan fetches a heap tuple that turns out to be 
> dead, the B-tree item is marked as killed by calling _bt_killitems. When 
> the page gets full, all the killed items are removed by calling 
> _bt_vacuum_one_page.
> 
> That's a problem for hot standby. If any of the killed b-tree items 
> point to a tuple that is still visible to a running read-only 
> transaction, we have the same situation as with vacuum, and have to 
> either wait for the read-only transaction to finish before applying the 
> WAL record or kill the transaction.
> 
> It looks like there's some cosmetic changes related to that in the 
> patch, the signature of _bt_delitems is modified, but there's no actual 
> changes that would handle that situation. I didn't see it on the TODO on 
> the hot standby wiki either. Am I missing something, or the patch?

ResolveRedoVisibilityConflicts() describes the current patch's position
on this point, which on review is wrong, I agree.

It looks like I assumed that _bt_delitems is only called during VACUUM,
which I knew it wasn't. I know I was going to split XLOG_BTREE_VACUUM
into two record types at one point, one for delete, one for vacuum. In
the end I didn't. Anyhow, its wrong.

We have infrastructure in place to make this work correctly, just need
to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
solved.

Thanks for spotting it. More like that please!

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> We have infrastructure in place to make this work correctly, just need
> to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
> solved.

That's tricky because there's no xmin/xmax on index tuples. You could 
conservatively use OldestXmin as latestRemovedXid, but that could stall 
the WAL redo a lot more than necessary. Or you could store 
latestRemovedXid in the page header, but that would need to be 
WAL-logged to ensure that it's valid after crash. Or you could look at 
the heap to fetch the xmin/xmax, but that would be expensive.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 12:24 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > We have infrastructure in place to make this work correctly, just need
> > to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
> > solved.
> 
> That's tricky because there's no xmin/xmax on index tuples.

Doh. 

>  You could 
> conservatively use OldestXmin as latestRemovedXid, but that could stall 
> the WAL redo a lot more than necessary. Or you could store 
> latestRemovedXid in the page header, but that would need to be 
> WAL-logged to ensure that it's valid after crash. Or you could look at 
> the heap to fetch the xmin/xmax, but that would be expensive.

Agreed. Probably need to use OldestXmin then.



If I was going to add anything to the btree page header, it would be
latestRemovedLSN, only set during recovery. That way we don't have to
explicitly kill queries, we can do the a wait on OldestXmin then let
them ERROR out when they find a page that has been modified.

I have a suspicion that we may need some modification of that solution
for all data blocks, so we don't kill too many queries.

Hmmm. I wonder if we can track latestRemovedLSN for all of
shared_buffers. That was initially rejected, but if we set the
latestRemovedLSN to be the block's LSN when we read it in, that would be
fairly useful. Either way we use 8 bytes RAM per buffer.



BTW, I noticed the other day that Oracle 11g only allows you to have a
read only slave *or* allows you to continue replaying. You need to
manually switch back and forth between those modes. They can't do
*both*, as Postgres will be able to do. That's because their undo
information is stored off-block in the Undo Tablespace, so is not
available for standby queries. Nice one, Postgres.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Greg Stark
Date:
I'm confused shouldn't read-only transactions on the slave just be  
hacked to not set any hint bits including lp_delete?

-- 
Greg


On 19 Dec 2008, at 03:49, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com > wrote:

> Whenever a B-tree index scan fetches a heap tuple that turns out to  
> be dead, the B-tree item is marked as killed by calling  
> _bt_killitems. When the page gets full, all the killed items are  
> removed by calling _bt_vacuum_one_page.
>
> That's a problem for hot standby. If any of the killed b-tree items  
> point to a tuple that is still visible to a running read-only  
> transaction, we have the same situation as with vacuum, and have to  
> either wait for the read-only transaction to finish before applying  
> the WAL record or kill the transaction.
>
> It looks like there's some cosmetic changes related to that in the  
> patch, the signature of _bt_delitems is modified, but there's no  
> actual changes that would handle that situation. I didn't see it on  
> the TODO on the hot standby wiki either. Am I missing something, or  
> the patch?
>
> -- 
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote:

> I'm confused shouldn't read-only transactions on the slave just be  
> hacked to not set any hint bits including lp_delete?

They could be, though I see no value in doing so. 

But that is not Heikki's point. He is discussing what happens on the
primary and the effects that must then occur on the standby. He has
rightly pointed out a (pluggable) hole in my logic.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
> If I was going to add anything to the btree page header, it would be
> latestRemovedLSN, only set during recovery. That way we don't have
to
> explicitly kill queries, we can do the a wait on OldestXmin then let
> them ERROR out when they find a page that has been modified.
> 
> I have a suspicion that we may need some modification of that
solution
> for all data blocks, so we don't kill too many queries.
If the failure is caused by the timing of various database
transactions, and the query is likely to run successfully after a
delay and a retry, please use SQLSTATE of '40001'.  Some software
(ours, for one) will recognize this and retry the query automatically,
so that the user impact is essentially the same as blocking.
-Kevin


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 11:54 -0600, Kevin Grittner wrote:
> >>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
>  
> > If I was going to add anything to the btree page header, it would be
> > latestRemovedLSN, only set during recovery. That way we don't have
> to
> > explicitly kill queries, we can do the a wait on OldestXmin then let
> > them ERROR out when they find a page that has been modified.
> > 
> > I have a suspicion that we may need some modification of that
> solution
> > for all data blocks, so we don't kill too many queries.
>  
> If the failure is caused by the timing of various database
> transactions, and the query is likely to run successfully after a
> delay and a retry, please use SQLSTATE of '40001'.  Some software
> (ours, for one) will recognize this and retry the query automatically,
> so that the user impact is essentially the same as blocking.

I understand the need, but we won't be using SQLSTATE = 40001.

That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that error
would not be.

The error message ought to be "snapshot too old", which could raise a
chuckle, so I called it something else.

The point you raise is a good one and I think we should publish a list
of retryable error messages. I contemplated once proposing a special log
level for a retryable error, but not quite a good idea.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 10:52 +0000, Simon Riggs wrote:

> >  You could 
> > conservatively use OldestXmin as latestRemovedXid, but that could stall 
> > the WAL redo a lot more than necessary. Or you could store 
> > latestRemovedXid in the page header, but that would need to be 
> > WAL-logged to ensure that it's valid after crash. Or you could look at 
> > the heap to fetch the xmin/xmax, but that would be expensive.
> 
> Agreed. Probably need to use OldestXmin then.

Just finished coding this up, plus TODO item to pin every index page.
Will post after some further testing.

Used RecentOldestXmin.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
> I understand the need, but we won't be using SQLSTATE = 40001.
> 
> That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that
error
> would not be.
Isn't it a problem with serialization of database transactions?  You
hit it in a different way, but if it is a temporary failure due to the
timing of the transactions, I strongly feel that that is the correct
SQLSTATE to use.  Perhaps more information to provide any useful
context could be in the info or hint areas?
-Kevin


Re: Hot standby and b-tree killed items

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> The error message ought to be "snapshot too old", which could raise a
> chuckle, so I called it something else.
>
> The point you raise is a good one and I think we should publish a list
> of retryable error messages. I contemplated once proposing a special log
> level for a retryable error, but not quite a good idea.

I'm a bit concerned about the idea of killing off queries to allow WAL to
proceed. While I have nothing against that being an option I think we should
be aiming to make it not necessary for correctness and not the default. By
default I think WAL replay should stick to stalling WAL replay and only resort
to killing queries if the user specifically requests it.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 18:59 +0000, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > The error message ought to be "snapshot too old", which could raise a
> > chuckle, so I called it something else.
> >
> > The point you raise is a good one and I think we should publish a list
> > of retryable error messages. I contemplated once proposing a special log
> > level for a retryable error, but not quite a good idea.
> 
> I'm a bit concerned about the idea of killing off queries to allow WAL to
> proceed. While I have nothing against that being an option I think we should
> be aiming to make it not necessary for correctness and not the default. By
> default I think WAL replay should stick to stalling WAL replay and only resort
> to killing queries if the user specifically requests it.

Increasing the waiting time increases the failover time and thus
decreases the value of the standby as an HA system. Others value high
availability higher than you and so we had agreed to provide an option
to allow the max waiting time to be set.

max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000
secs, settable in milliseconds. So think of it like a deadlock detector
for recovery apply.

Also, there is a set of functions to control the way recovery proceeds,
much as you might control an MP3 player (start, stop, pause). There ares
also functions to pause at specific xids, pause at specific time, pause
at the next cleanup record. That allows you to set the max_standby_delay
lower and then freeze the server for longer to run a long query if
required. It also allows you to do PITR by trial and error rather than
one shot specify-in-advance settings. There is a function to manually
end recovery at a useful place if desired.

I hope your needs and wishes are catered for by that?

(I have a Plan B in case we need it during wider user testing, as
explained up thread.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
> max_standby_delay is set in recovery.conf, value 0 (forever) -
2,000,000
> secs, settable in milliseconds. So think of it like a deadlock
detector
> for recovery apply.
Aha!  A deadlock is a type of serialization failure.  (In fact, on
databases with lock-based concurrency control rather than MVCC, it can
be the ONLY type of serialization failure.)
-Kevin


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 13:47 -0600, Kevin Grittner wrote:
> >>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
>  
> > max_standby_delay is set in recovery.conf, value 0 (forever) -
> 2,000,000
> > secs, settable in milliseconds. So think of it like a deadlock
> detector
> > for recovery apply.
>  
> Aha!  A deadlock is a type of serialization failure.  (In fact, on
> databases with lock-based concurrency control rather than MVCC, it can
> be the ONLY type of serialization failure.)

The SQL Standard specifically names this error as thrown when "it
detects the inability to guarantee the serializability of two or more
concurrent SQL-transactions". Now that really should only apply when
running with SERIALIZABLE transactions, but I grant you the standard
doesn't explicitly say that.

You give me the strange sense that you want this because of some quirk
in your software, rather than an overwhelming desire to see these two
situations described the same.

I guess making it that SQLSTATE would make it simpler to understand why
the error occurs and also how to handle it (i.e. resubmit). So there
probably is a wide argument for making developers jobs a little easier
by doing it. i.e. usability will be improved if we do that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> Increasing the waiting time increases the failover time and thus
> decreases the value of the standby as an HA system. Others value high
> availability higher than you and so we had agreed to provide an option
> to allow the max waiting time to be set.

Sure, it's a nice option to have. But I think the default should be to pause
WAL replay.

The question I had was whether your solution for btree pointers marked dead
and later dropped from the index works when the user hasn't configured a
timeout and doesn't want standby queries killed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Hot standby and b-tree killed items

From
Gregory Stark
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

>>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
>  
>> max_standby_delay is set in recovery.conf, value 0 (forever) -
> 2,000,000
>> secs, settable in milliseconds. So think of it like a deadlock
> detector
>> for recovery apply.
>  
> Aha!  A deadlock is a type of serialization failure.  (In fact, on
> databases with lock-based concurrency control rather than MVCC, it can
> be the ONLY type of serialization failure.)

I think the fundamental difference is that a deadlock or serialization failure
can be predicted as a potential problem when writing the code. This is
something that can happen for any query any time, even plain old read-only
select queries.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
> The SQL Standard specifically names this error as thrown when "it
> detects the inability to guarantee the serializability of two or
more
> concurrent SQL-transactions". Now that really should only apply when
> running with SERIALIZABLE transactions,
I disagree.  Database integrity could not be guaranteed without
detection of conflicting modification in READ COMMITTED on up, and
this is the normal means of indicating these problems.
> but I grant you the standard doesn't explicitly say that.
I think that's intentional.
> You give me the strange sense that you want this because of some
quirk
> in your software, rather than an overwhelming desire to see these
two
> situations described the same.
Well, we are very unlikely to ever use this feature, so it's not
really something I care about for us; it just struck me that there may
be others that care about categorizing errors accurately according the
the SQL standard, and that what you were describing sounded like a new
type of serialization failure in the PostgreSQL environment, and
should be classified that way.
The primary quirkiness of our software is that it needs to be able to
run with a number of different database products, and we do want to
take advantage of whatever information is available in a portable
format.  This is not the only standard SQLSTATE we look for and handle
appropriately for the documented meaning, but it is an important one,
as it has simplified application programming and reduced the confusing
error messages which reach our end users.
> I guess making it that SQLSTATE would make it simpler to understand
why
> the error occurs and also how to handle it (i.e. resubmit).
Precisely.
-Kevin


Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote: 
> I think the fundamental difference is that a deadlock or
serialization 
> failure
> can be predicted as a potential problem when writing the code. This
is
> something that can happen for any query any time, even plain old
read-only
> select queries.
I've heard that on Oracle it is (or at least was) possible to get a
serialization failure on a single SELECT statement which was the only
user-requested activity on the system, because it could conflict with
automatic maintenance operations.
In Sybase and Microsoft databases it is definitely possible for a
plain old read-only SELECT statement to be a deadlock victim (reported
as a serialization failure) if some of the data it is referencing is
being updated concurrently.  In these (and many other) products, a
lock must be acquired before a row can be read.  Imagine, the SELECT
locks a row against updates, another transaction locks some other row
against any access, then the UPDATE tries to change the row locked by
the SELECT while the SELECT tries to read the row locked by the
UPDATE.
PostgreSQL is much less prone to serialization failures, but it is
certainly understandable if hot standby replication introduces new
cases of it.
-Kevin


Re: Hot standby and b-tree killed items

From
Gregory Stark
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> PostgreSQL is much less prone to serialization failures, but it is
> certainly understandable if hot standby replication introduces new
> cases of it.

In this case it will be possible to get this error even if you're just running
a single SELECT query -- and that's the *only* query in the database at all.

A vacuum being replayed -- even in a different database -- could trigger the
error. Or with the btree split issue, a data load -- again even in a different
database -- would be quite likely cause your SELECT to be killed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Hot standby and b-tree killed items

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote: 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> 
>> PostgreSQL is much less prone to serialization failures, but it is
>> certainly understandable if hot standby replication introduces new
>> cases of it.
> 
> In this case it will be possible to get this error even if you're
just 
> running
> a single SELECT query -- and that's the *only* query in the database
at all.
> 
> A vacuum being replayed -- even in a different database -- could
trigger the
> error. Or with the btree split issue, a data load -- again even in a
different
> database -- would be quite likely cause your SELECT to be killed.
OK.  Does that make serialization failure a bad description of the
problem?
If these steps are serialized (run one after the other), is there a
problem?  It just seems that the hot standby near-synchronous
replication creates a situation where tasks on multiple, linked
databases might need to be serialized.
It does seem like it will be important to provide as much information
to the user about what's causing the problem, and hints about what to
do.  PostgreSQL has nice features for that, though.
Since I have no vested interest here, I'm not inclined to belabor the
point.  I was really just trying to make sure the feature was as
useful as possible to others, some of whom might be looking for
standard SQLSTATE values to help the software take the right course.
If others feel the 40001 code would confuse more than enlighten, I'll
respect that.
-Kevin


Re: Hot standby and b-tree killed items

From
Robert Treat
Date:
On Friday 19 December 2008 05:52:42 Simon Riggs wrote:
> BTW, I noticed the other day that Oracle 11g only allows you to have a
> read only slave *or* allows you to continue replaying. You need to
> manually switch back and forth between those modes. They can't do
> *both*, as Postgres will be able to do. That's because their undo
> information is stored off-block in the Undo Tablespace, so is not
> available for standby queries. Nice one, Postgres.
>

I think this is true for physical replay, but Oracle also offers the option to 
do logical replay (where transaction logs are converted into sql and run 
against the standby; i believe this is similar to what continuant is trying 
to do with thier latest offering). In that scenario you can do read and 
replay at the same time, though I think there are some conflicts possible; 
fewer than what postgres will have, since I think most of thier DDL can be 
done online. (This might require some extra modules / high end version of 
Oracle, please consult your local Oracle wizard for more details)

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 20:54 +0000, Gregory Stark wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> 
> > PostgreSQL is much less prone to serialization failures, but it is
> > certainly understandable if hot standby replication introduces new
> > cases of it.
> 
> In this case it will be possible to get this error even if you're just running
> a single SELECT query -- and that's the *only* query in the database at all.
> 
> A vacuum being replayed -- even in a different database -- could trigger the
> error. Or with the btree split issue, a data load -- again even in a different
> database -- would be quite likely cause your SELECT to be killed.

Quite likely? "You're all doomed I say!", his eyes rolling wildly. :-)

The standby is an extension of the primary and is quite literally
running the same transactions. This "only query" idea isn't the right
way to think about it. It's fairly easily possible to predict it will
happen and it will happen only in same database as transactions on the
primary. And as we just said, you can control whether and/or after how
long this will happen in some detail.

Industry context: In the worst case this is as bad as Oracle 11g. In
many/most cases it is much better.

Perhaps we should listen to the people that have said they don't want
queries cancelled, even if the alternative is inconsistent answers. That
is easily possible yet is not currently an option. Plus we have the
option I referred to up thread, which is to defer query cancel until the
query reads a modified data block. I'm OK with implementing either of
those, as non-default options. Do we need those options or are we ok?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 19:29 -0500, Robert Treat wrote:
> On Friday 19 December 2008 05:52:42 Simon Riggs wrote:
> > BTW, I noticed the other day that Oracle 11g only allows you to have a
> > read only slave *or* allows you to continue replaying. You need to
> > manually switch back and forth between those modes. They can't do
> > *both*, as Postgres will be able to do. That's because their undo
> > information is stored off-block in the Undo Tablespace, so is not
> > available for standby queries. Nice one, Postgres.
> >
> 
> I think this is true for physical replay, but Oracle also offers the option to 
> do logical replay (where transaction logs are converted into sql and run 
> against the standby; i believe this is similar to what continuant is trying 
> to do with thier latest offering). In that scenario you can do read and 
> replay at the same time, though I think there are some conflicts possible; 
> fewer than what postgres will have, since I think most of thier DDL can be 
> done online. 

That is also an option I have argued that we need, BTW.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
>> Increasing the waiting time increases the failover time and thus
>> decreases the value of the standby as an HA system. Others value high
>> availability higher than you and so we had agreed to provide an option
>> to allow the max waiting time to be set.
> 
> Sure, it's a nice option to have. But I think the default should be to pause
> WAL replay.

I think I agree that pausing should be the default. If for no other 
reason, because I can't think of a good default for max_standby_delay.

It would be nice to have a setting to specify the max. amount of 
unapplied WAL before killing queries. When the primary isn't doing much, 
you might want wait longer before killing queries, and if you're falling 
behind a lot, you might want to kill queries more aggressively to catch 
up. I guess that doesn't quite fit the current architecture; you'd need 
to peek ahead to see how much unapplied WAL there is.

> The question I had was whether your solution for btree pointers marked dead
> and later dropped from the index works when the user hasn't configured a
> timeout and doesn't want standby queries killed.

Yes, it's not any different from vacuum WAL records.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Gregory Stark wrote:
>> The question I had was whether your solution for btree pointers marked 
>> dead
>> and later dropped from the index works when the user hasn't configured a
>> timeout and doesn't want standby queries killed.
> 
> Yes, it's not any different from vacuum WAL records.

No wait, there is a nasty corner-case. When an index tuple is marked as 
killed, no WAL record is written. Since there's now WAL record, it won't 
be killed in the slave yet. But if we take a full-page image of that 
page later for some other operation, the LP_DEAD flag is included in the 
full-page image. If the flag sneaks into the slave without an explicit 
WAL record like that, there's no latestRemovedXid for the slave to wait on.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> > 
> >> Increasing the waiting time increases the failover time and thus
> >> decreases the value of the standby as an HA system. Others value high
> >> availability higher than you and so we had agreed to provide an option
> >> to allow the max waiting time to be set.
> > 
> > Sure, it's a nice option to have. But I think the default should be to pause
> > WAL replay.
> 
> I think I agree that pausing should be the default. If for no other 
> reason, because I can't think of a good default for max_standby_delay.

I would rather err on the side of caution. If we do as you suggest,
somebody will lose their database and start shouting "stupid default".
So I would suggest we set it to say 5 seconds to start with and let
people that read the manual set it higher, or at least read the manual
after they receive their first query cancellation.

> It would be nice to have a setting to specify the max. amount of 
> unapplied WAL before killing queries. 

Agreed.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>>
>>>> Increasing the waiting time increases the failover time and thus
>>>> decreases the value of the standby as an HA system. Others value high
>>>> availability higher than you and so we had agreed to provide an option
>>>> to allow the max waiting time to be set.
>>> Sure, it's a nice option to have. But I think the default should be to pause
>>> WAL replay.
>> I think I agree that pausing should be the default. If for no other 
>> reason, because I can't think of a good default for max_standby_delay.
> 
> I would rather err on the side of caution. If we do as you suggest,
> somebody will lose their database and start shouting "stupid default".

Even if we stop applying the WAL, it should still be archived safely, 
right? So no data should be lost, although the standby can fall very 
much behind, and it can take a while to catch up.

> So I would suggest we set it to say 5 seconds to start with and let
> people that read the manual set it higher, or at least read the manual
> after they receive their first query cancellation.

I don't feel strongly either way...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> A vacuum being replayed -- even in a different database -- could trigger the
> error. Or with the btree split issue, a data load -- again even in a different
> database -- would be quite likely cause your SELECT to be killed.

Hmm, I wonder if we should/could track the "latestRemovedXid" separately 
for each database. There's no reason why we need to kill a read-only 
query in database X when a table in database Y is vacuumed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Gregory Stark wrote:
>> A vacuum being replayed -- even in a different database -- could trigger the
>> error. Or with the btree split issue, a data load -- again even in a different
>> database -- would be quite likely cause your SELECT to be killed.
>
> Hmm, I wonder if we should/could track the "latestRemovedXid" separately  
> for each database. There's no reason why we need to kill a read-only  
> query in database X when a table in database Y is vacuumed.

What about shared catalogs?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> A vacuum being replayed -- even in a different database -- could trigger the
>>> error. Or with the btree split issue, a data load -- again even in a different
>>> database -- would be quite likely cause your SELECT to be killed.
>> Hmm, I wonder if we should/could track the "latestRemovedXid" separately  
>> for each database. There's no reason why we need to kill a read-only  
>> query in database X when a table in database Y is vacuumed.
> 
> What about shared catalogs?

True, vacuums on shared catalogs would affect read-only queries on all 
databases.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>>> Gregory Stark wrote:
>>>> A vacuum being replayed -- even in a different database -- could trigger the
>>>> error. Or with the btree split issue, a data load -- again even in a different
>>>> database -- would be quite likely cause your SELECT to be killed.
>>> Hmm, I wonder if we should/could track the "latestRemovedXid" 
>>> separately  for each database. There's no reason why we need to kill 
>>> a read-only  query in database X when a table in database Y is 
>>> vacuumed.
>>
>> What about shared catalogs?
>
> True, vacuums on shared catalogs would affect read-only queries on all  
> databases.

Maybe it's possible to track latestRemovedXid for each database, and
additionally another counter that tracks vacuums on shared catalogs.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Sat, 2008-12-20 at 22:07 +0200, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > A vacuum being replayed -- even in a different database -- could trigger the
> > error. Or with the btree split issue, a data load -- again even in a different
> > database -- would be quite likely cause your SELECT to be killed.
> 
> Hmm, I wonder if we should/could track the "latestRemovedXid" separately 
> for each database. There's no reason why we need to kill a read-only 
> query in database X when a table in database Y is vacuumed.

Already implemented in code. see ResolveRedoVisibilityConflicts()

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Sat, 2008-12-20 at 20:09 -0300, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> > Gregory Stark wrote:
> >> A vacuum being replayed -- even in a different database -- could trigger the
> >> error. Or with the btree split issue, a data load -- again even in a different
> >> database -- would be quite likely cause your SELECT to be killed.
> >
> > Hmm, I wonder if we should/could track the "latestRemovedXid" separately  
> > for each database. There's no reason why we need to kill a read-only  
> > query in database X when a table in database Y is vacuumed.
> 
> What about shared catalogs?

Hot Standby already covers this special case. Patch uses
GetCurrentVirtualXIDs(), which treats a dbOid of 0 to match against all
databases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 14:23 -0600, Kevin Grittner wrote:
> > I guess making it that SQLSTATE would make it simpler to understand
> why
> > the error occurs and also how to handle it (i.e. resubmit).
>  
> Precisely.

Just confirming I will implement the SQLSTATE as requested.

I recognize my own and Greg's arguments that the match is not perfect,
but the Standard isn't clear on this and Kevin's interpretation is the
more useful behaviour for developers.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Robert Treat
Date:
On Saturday 20 December 2008 04:10:21 Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote:
> >> Gregory Stark wrote:
> >>> Simon Riggs <simon@2ndQuadrant.com> writes:
> >>>> Increasing the waiting time increases the failover time and thus
> >>>> decreases the value of the standby as an HA system. Others value high
> >>>> availability higher than you and so we had agreed to provide an option
> >>>> to allow the max waiting time to be set.
> >>>
> >>> Sure, it's a nice option to have. But I think the default should be to
> >>> pause WAL replay.
> >>
> >> I think I agree that pausing should be the default. If for no other
> >> reason, because I can't think of a good default for max_standby_delay.
> >
> > I would rather err on the side of caution. If we do as you suggest,
> > somebody will lose their database and start shouting "stupid default".
>
> Even if we stop applying the WAL, it should still be archived safely,
> right? So no data should be lost, although the standby can fall very
> much behind, and it can take a while to catch up.
>

I was thinking the condition Simon was concerned about was that on a very busy 
slave with wal delay, you could theoretically fill up the disks and destroy 
the slave. With query cancel, you might be annoyed to see the queries 
canceled, but theres no way that you would destroy the slave. (That might not 
have been what he meant though)

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


Re: Hot standby and b-tree killed items

From
Robert Treat
Date:
On Friday 19 December 2008 19:36:42 Simon Riggs wrote:
> Perhaps we should listen to the people that have said they don't want
> queries cancelled, even if the alternative is inconsistent answers. That
> is easily possible yet is not currently an option. Plus we have the
> option I referred to up thread, which is to defer query cancel until the
> query reads a modified data block. I'm OK with implementing either of
> those, as non-default options. Do we need those options or are we ok?
>

Haven't seen any feed back on this, but I think the two options of cancel 
query for replay, and pause replay for queries, are probably enough for a 
first go around (especially if you can get the query canceling to work only 
when changes are made to the specific database in question)

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote:
> On Friday 19 December 2008 19:36:42 Simon Riggs wrote:
> > Perhaps we should listen to the people that have said they don't want
> > queries cancelled, even if the alternative is inconsistent answers. That
> > is easily possible yet is not currently an option. Plus we have the
> > option I referred to up thread, which is to defer query cancel until the
> > query reads a modified data block. I'm OK with implementing either of
> > those, as non-default options. Do we need those options or are we ok?
> >
> 
> Haven't seen any feed back on this, but I think the two options of cancel 
> query for replay, and pause replay for queries, are probably enough for a 
> first go around (especially if you can get the query canceling to work only 
> when changes are made to the specific database in question)

Thanks for picking up on this. This question is the #1 usability issue
for Hot Standby, since at least May 2008. There are many potential
additions and we need to track this carefully over the next few months
to see if we have it just right. I'll take viewpoints at any time on
that; this door is never closed, though tempus fugit.

Greg and Heikki have highlighted in this thread some aspects of btree
garbage collection that will increase the chance of queries being
cancelled in various circumstances. If this is important enough to
trigger additional actions then we need to highlight that now so we have
time to take those corrective actions. 

I've listened to many different viewpoints on and off list. Everybody
takes a slightly different angle on it and I'm in favour of giving
everybody what they want with the right set of options.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
> Greg and Heikki have highlighted in this thread some aspects of btree
> garbage collection that will increase the chance of queries being
> cancelled in various circumstances

Even HOT-prune may lead to frequent query cancellations and unlike
VACUUM there is no way user can control the frequency of prune
operations.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote:
> On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >
> > Greg and Heikki have highlighted in this thread some aspects of btree
> > garbage collection that will increase the chance of queries being
> > cancelled in various circumstances
> 
> Even HOT-prune may lead to frequent query cancellations and unlike
> VACUUM there is no way user can control the frequency of prune
> operations.

The patch does go to some trouble to handle that case, as I'm sure
you've seen. Are you saying that part of the patch is ineffective and
should be removed, or?

Should/could there be a way to control frequency of prune operations? We
could maintain cleanupxmin as a constant minimum distance from xmax, for
example.

Are we saying we should take further measures, as I asked upthread? If
it is a consensus that I take some action, then I will.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>

>
> The patch does go to some trouble to handle that case, as I'm sure
> you've seen. Are you saying that part of the patch is ineffective and
> should be removed, or?
>

Umm.. are you talking about the "wait" mechanism ? That's the only
thing I remember. Otherwise, prune record is pretty much same as any
vacuum cleanup record.

> Should/could there be a way to control frequency of prune operations? We
> could maintain cleanupxmin as a constant minimum distance from xmax, for
> example.
>

Well, there can be. But tuning any such thing might be difficult and
would have implications on the primary. I am not saying we can do
that, but we will need additional tests to see its impact.

> Are we saying we should take further measures, as I asked upthread? If
> it is a consensus that I take some action, then I will.
>

Again, I haven't seen how frequently queries may get canceled. Or if
the delay is set to a large value, how far behind standby may get
during replication, so I can't really comment. Have you done any tests
on a reasonable hardware and checked if moderately long read queries
can be run on standby without standby lagging behind a lot.

I would prefer to have a solution which can be smarter than canceling
all queries as soon as a cleanup record comes and timeout occurs. For
example, if the queries are being run on a completely different set of
tables where as the updates/deletes are happening on another set of
tables, there is no reason why those queries should be canceled. I
think it would be very common to have large history tables which may
receive long read-only queries, but no updates/deletes. Whereas other
frequently updated tables which receive very few queries on the
standby.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote:
> On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> 
> >
> > The patch does go to some trouble to handle that case, as I'm sure
> > you've seen. Are you saying that part of the patch is ineffective and
> > should be removed, or?
> >
> 
> Umm.. are you talking about the "wait" mechanism ? That's the only
> thing I remember. Otherwise, prune record is pretty much same as any
> vacuum cleanup record.

With respect, I was hoping you might look in the patch and see if you
agree with the way it is handled. No need to remember. The whole
latestRemovedXid concept is designed to do help.

> > Should/could there be a way to control frequency of prune operations? We
> > could maintain cleanupxmin as a constant minimum distance from xmax, for
> > example.
> >
> 
> Well, there can be. But tuning any such thing might be difficult and
> would have implications on the primary. I am not saying we can do
> that, but we will need additional tests to see its impact.
> 
> > Are we saying we should take further measures, as I asked upthread? If
> > it is a consensus that I take some action, then I will.
> >
> 
> Again, I haven't seen how frequently queries may get canceled. Or if
> the delay is set to a large value, how far behind standby may get
> during replication, so I can't really comment. Have you done any tests
> on a reasonable hardware and checked if moderately long read queries
> can be run on standby without standby lagging behind a lot.

Queries get cancelled if data they need to see if removed and the
max_standby_delay expires. So lag will be max_standby_delay, by
definition.

Not sure what further tests would show. Queries that run for longer than
max_standby delay plus mean time between cleanup records will currently
end up being cancelled.

> I would prefer to have a solution which can be smarter than canceling
> all queries as soon as a cleanup record comes and timeout occurs. 

Currently, it was the consensus view that queries should be cancelled,
though there are other options still on the table.

It's discussed in Design Notes on the Wiki. "Simply ignoring WAL removal
has been discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php"

> For
> example, if the queries are being run on a completely different set of
> tables where as the updates/deletes are happening on another set of
> tables, there is no reason why those queries should be canceled. I
> think it would be very common to have large history tables which may
> receive long read-only queries, but no updates/deletes. Whereas other
> frequently updated tables which receive very few queries on the
> standby.

There is currently no way to tell which tables a query will touch during
the course of its execution. Nor is there likely to be because of
user-defined volatile functions.

I attempted to find ways to explicitly limit the set of tables over
which a query might venture, but that cam to nothing also.

We've also discussed storing lastCleanedLSN for each buffer, so queries
can cancel themselves if they need to read a buffer that has had data
removed from it that they would have needed to see. I'll write that up
also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Robert Treat
Date:
On Wednesday 24 December 2008 08:48:04 Simon Riggs wrote:
> On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote:
> > Again, I haven't seen how frequently queries may get canceled. Or if
> > the delay is set to a large value, how far behind standby may get
> > during replication, so I can't really comment. Have you done any tests
> > on a reasonable hardware and checked if moderately long read queries
> > can be run on standby without standby lagging behind a lot.
>
> Queries get cancelled if data they need to see if removed and the
> max_standby_delay expires. So lag will be max_standby_delay, by
> definition.
>
> Not sure what further tests would show. Queries that run for longer than
> max_standby delay plus mean time between cleanup records will currently
> end up being cancelled.
>

I think the uncertainty comes from peoples experience with typical replication 
use cases vs a lack of experience with this current implementation.  

One such example is that it is pretty common to use read-only slaves to do 
horizontal scaling of read queries across a bunch of slaves. This is not the 
scenario of running reporting queries on a second machine to lower load; you 
would be running a large number of read-only, relativly short, oltp-ish 
queries (think pg_benchs select only test i suppose), but you also have a 
fairly regular stream of inserts/updates going on with these same tables, its 
just you have 95/5 split of read/write (or similar). 

This is standard practice in things like mysql or using slony or what have 
you. I suspect it's one of the first things people are going to want to do 
with hot standby. But it's unclear how well this will work because we don't 
have any experience with it yet, coupled with the two downsides being 
mentioned as canceled queries and replay lag, which happen to be probably the 
two worst downsides you would have in the above scenario. :-)

Hmm.... I'm not sure why I didn't think of running this test before, but  
read/write pg_bench on a master with pg_bench select test on slave isn't that 
bad of a scenario to match the above; it might be a little too much activity 
on the master, but has anyone else run such a test? 

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote:

> I think the uncertainty comes from peoples experience with typical replication 
> use cases vs a lack of experience with this current implementation.  

Quite possibly.

Publishing user feedback on this will be very important in making this a
usable feature.

I'd be very happy if you were to direct the search for optimal
usability.

> One such example is that it is pretty common to use read-only slaves to do 
> horizontal scaling of read queries across a bunch of slaves. This is not the 
> scenario of running reporting queries on a second machine to lower load; you 
> would be running a large number of read-only, relativly short, oltp-ish 
> queries (think pg_benchs select only test i suppose), but you also have a 
> fairly regular stream of inserts/updates going on with these same tables, its 
> just you have 95/5 split of read/write (or similar). 

One thing to consider also is latency of information. Sending queries to
master or slave may return different answers if querying very recent
data.

> This is standard practice in things like mysql or using slony or what have 
> you. I suspect it's one of the first things people are going to want to do 
> with hot standby. But it's unclear how well this will work because we don't 
> have any experience with it yet, coupled with the two downsides being 
> mentioned as canceled queries and replay lag, which happen to be probably the 
> two worst downsides you would have in the above scenario. :-)
> 
> Hmm.... I'm not sure why I didn't think of running this test before, but  
> read/write pg_bench on a master with pg_bench select test on slave isn't that 
> bad of a scenario to match the above; it might be a little too much activity 
> on the master, but has anyone else run such a test? 
> 
-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
>
> With respect, I was hoping you might look in the patch and see if you
> agree with the way it is handled. No need to remember. The whole
> latestRemovedXid concept is designed to do help.
>

Well, that's common for all cleanup record including vacuum. But
reading your comment, it seemed as there is something special to
handle HOT prune case which I did not see. Anyways, the trouble with
HOT prune is that uples may be cleaned up very frequently and that can
lead to query cancellation at the standby. That's what I wanted to
emphasize.


>
> Queries get cancelled if data they need to see if removed and the
> max_standby_delay expires. So lag will be max_standby_delay, by
> definition.

That's per cleanup record, isn't it ?


> We've also discussed storing lastCleanedLSN for each buffer, so queries
> can cancel themselves if they need to read a buffer that has had data
> removed from it that they would have needed to see. I'll write that up
> also.
>

What if we do that at table level ? So if a query touches a table
which had cleanup activity since the query was started, it cancels
itself automatically,

Happy X'mas to all of you!

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
"marcin mank"
Date:
> Perhaps we should listen to the people that have said they don't want
> queries cancelled, even if the alternative is inconsistent answers.

I think an alternative to that would be "if the wal backlog is too
big, let current queries finish and let incoming queries wait till the
backlog gets smaller".

fell free to ignore me, as a non-hacker I`m not even supposed to be
reading this list :-]

Greetings
Marcin


Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
marcin mank wrote:
>> Perhaps we should listen to the people that have said they don't want
>> queries cancelled, even if the alternative is inconsistent answers.

I don't like that much. PostgreSQL has traditionally avoided that very 
hard. It's hard to tell what kind of inconsistencies you'd get, as it'd 
depend on what plan is created, when a vacuum happens to run on master etc.

> I think an alternative to that would be "if the wal backlog is too
> big, let current queries finish and let incoming queries wait till the
> backlog gets smaller".

Yeah, that makes sense too.

Many approaches have been proposed, and they all have different 
tradeoffs and therefore fit different use cases. I'm not sure which ones 
are/will be included in the patch. We don't need all in 8.4, one or two 
simplest ones will do just fine, and we can extend later.

Let me summarize. Whenever a WAL record conflicts with a 
query-in-progress, we can:

1. kill the query, or
2. wait for the query to finish
3. let the query proceed, producing invalid results.

There's some combinations of those as well. You're proposal is a 
variation of 2, to avoid the problem of WAL application falling behind 
indefinitely. There's also the max_standby_delay option in the patch, to 
wait a while, and then kill the query.

There's some additional optimizations that can be made to make those 
options less painful. Instead of killing all queries that might be 
affected by a vacuum record, only kill them when they actually hit a 
block that was vacuumed (Simon's idea of latestRemovedLSN field in page 
header).

Another line of attack is to avoid getting into the situation in the 
first place, by affecting behavior on the master. If the standby has an 
online connection to the master (per the synch rep patch), it can tell 
master what the slave's OldestXmin is, and master can take that into 
account and not remove tuples still needed by the slave. That's not good 
from high availability point of view, you don't want a hung query in the 
slave to cause a long-running-transaction situation in the master, but 
for other use cases it would be fine. Or we can just add a constant # of 
transactions to OldestXmin in master, to get some breathing room in the 
server.

The bottom line is that we have enough options to make everyone happy. 
Some understanding of the issue is required to tune it properly, 
however, so documentation is important.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote:

> I'm confused shouldn't read-only transactions on the slave just be  
> hacked to not set any hint bits including lp_delete?

It seems there are multiple issues involved and I saw only the first of
these initially. I want to explicitly separate these issues so we can
discuss them more easily.

1. When we replay an XLOG_BTREE_DELETE record, we may have to
wait-then-cancel-etc other sessions.

Possibly a pain, but these records are not very common now that we have
HOT, except on certain kinds of queue table.

2. Should we ignore the LP_DEAD flag on btree rows when we are using the
index during recovery? As Heikki points out, this hint bit is not WAL
logged, but can appear in the standby as a result of full page writes.
The LP_DEAD flags will have been set using a different xmin to the one
on the standby and would cause index rows to be ignored that should have
been included in a correct MVCC answer.

So we need to either

(a) always ignore LP_DEAD flags we see when reading index during
recovery.

(b) include an additional step to clean the full page writes to remove
LP_DEAD hints from the incoming pages.

(b) is feasible, but would need to be repeated each time a new full page
arrived, so a page may need to be re-cleaned many times. Sounds like a
bad plan, so we should choose (a).

3. Should we set LP_DELETE flag on btree rows when we are using the
index during recovery? Not much point if we are ignoring them.

There is no space for an additional flag, to distinguish between primary
and standby hint bits.


Issues (2) and (3) would go away entirely if both standby and primary
always had the same xmin value as a system-wide setting. i.e. the
standby and primary are locked together at their xmins. Perhaps that was
Heikki's intention in recent suggestions? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Issues (2) and (3) would go away entirely if both standby and primary
> always had the same xmin value as a system-wide setting. i.e. the
> standby and primary are locked together at their xmins. Perhaps that was
> Heikki's intention in recent suggestions? 

No, I only suggested that as an optional optimization. We can't rely on 
it, because the queries on standby should still work correctly if the 
connection to primary is lost for some reason, or if the primary decides 
not to honor standby's xmin, perhaps to avoid the usual issues with 
long-running-transactions.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> (a) always ignore LP_DEAD flags we see when reading index during
> recovery.

This sounds simplest, and it's nice to not clear the flags for the 
benefit of transactions running after the recovery is done.

You have to be careful to ignore the flags in read-only transactions 
that started in hot standby mode, even if recovery has since ended and 
we're in normal operation now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > (a) always ignore LP_DEAD flags we see when reading index during
> > recovery.
> 
> This sounds simplest, and it's nice to not clear the flags for the 
> benefit of transactions running after the recovery is done.

Agreed.

(Also: Transaction hint bits are always set correctly, because we would
only ever see a full page write with hints set after the commit/abort
record was processed. So I continue to honour transaction hint bit
reading and setting during recovery).

> You have to be careful to ignore the flags in read-only transactions 
> that started in hot standby mode, even if recovery has since ended and 
> we're in normal operation now.

Got that.

I'm setting ignore_killed_tuples = false at the start of any index scan
during recovery. And kill_prior_tuples is never set true when in
recovery. Both measures are AM-agnostic.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby and b-tree killed items

From
Simon Riggs
Date:
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote:

> You have to be careful to ignore the flags in read-only transactions 
> that started in hot standby mode, even if recovery has since ended and 
> we're in normal operation now.

My initial implementation in v6 worked, but had a corner case if a
transaction spanned the change from recovery into normal processing.

I'm now done on a more complete fix, will be in v8.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support