Thread: Summary and Plan for Hot Standby

Summary and Plan for Hot Standby

From
Simon Riggs
Date:
After some time thinking about the best way forward for Hot Standby, I
have some observations and proposals.

First, the project is very large. We have agreed ways to trim the patch,
yet it remains large. Trying to do everything in one lump is almost
always a bad plan, so we need to phase things.

Second, everybody is keen that HS hits the tree, so we can have alpha
code etc.. There are a few remaining issues that should *not* be rushed.
The only way to remove this dependency is to decouple parts of the
project.

Third, testing the patch is difficult and continuous change makes it
harder to guarantee everything is working.

There are two remaining areas of significant thought/effort:

* Issues relating to handling of prepared transactions
* How fast Hot Standby mode is enabled in the standby

I propose that we stabilise and eventually commit a version of HS that
circumvents/defers those issues and then address the issues with
separate patches afterwards. This approach will allow us to isolate the
areas of further change so we can have a test blitz to remove silly
mistakes, then follow it with a commit to CVS, and then release as Alpha
to allow further testing.

Let's look at the two areas of difficulty in more detail

* Issues relating to handling of prepared transactions
There are some delicate issues surrounding what happens at the end of
recovery if there is a prepared transaction still holding an access
exclusive lock. It is straightforward to say, as an interim measure,
"Hot Standby will not work with max_prepared_transactions > 0". I see
that this has a fiddly, yet fairly clear solution.

* How fast Hot Standby mode is enabled in the standby
We need to have full snapshot information on the standby before we can
allow connections and queries. There are two basic approaches: i) we
wait until we *know* we have full info or ii) we try to collect data and
inject a correct starting condition. Waiting (i) may take a while, but
is clean and requires only a few lines of code. Injecting the starting
condition (ii) requires boatloads of hectic code and we have been unable
to agree a way forwards. If we did have that code, all it would give us
is a faster/more reliable starting point for connections on the standby.
Until we can make approach (ii) work, we should just rely on the easy
approach (i). In many cases, the starting point is very similar. (In
some cases we can actually make (i) faster because the overhead of data
collection forces us to derive the starting conditions minutes apart.)

Phasing the commit seems like the only way.

Please can we agree a way forwards?

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Magnus Hagander
Date:
On Sun, Nov 15, 2009 at 09:06, Simon Riggs <simon@2ndquadrant.com> wrote:
> * Issues relating to handling of prepared transactions
> There are some delicate issues surrounding what happens at the end of
> recovery if there is a prepared transaction still holding an access
> exclusive lock. It is straightforward to say, as an interim measure,
> "Hot Standby will not work with max_prepared_transactions > 0". I see
> that this has a fiddly, yet fairly clear solution.

If that restriction will solve issues we have now, I find that a
perfectly reasonable restriction. Even if it were to still be there
past release, and only get fixed in a future release. The vast
majority of our users don't use 2PC at all. Most cases where people
had it enalbed used to be because it was enabled by default, and the
large majority of cases where I've seen people increase it has
actually been because they thought it meant prepared statements, not
prepared transactions.

So definitely +1.


> * How fast Hot Standby mode is enabled in the standby
> We need to have full snapshot information on the standby before we can
> allow connections and queries. There are two basic approaches: i) we
> wait until we *know* we have full info or ii) we try to collect data and
> inject a correct starting condition. Waiting (i) may take a while, but
> is clean and requires only a few lines of code. Injecting the starting
> condition (ii) requires boatloads of hectic code and we have been unable
> to agree a way forwards. If we did have that code, all it would give us
> is a faster/more reliable starting point for connections on the standby.
> Until we can make approach (ii) work, we should just rely on the easy
> approach (i). In many cases, the starting point is very similar. (In
> some cases we can actually make (i) faster because the overhead of data
> collection forces us to derive the starting conditions minutes apart.)

That also seems perfectly reasonable, depending on how long the
waiting on (i) will be :-) What does the time depend on?


> Phasing the commit seems like the only way.

Yeah, we usually recommend that in other cases, so I don't see why it
shouldn't apply to HS. Seems like a good way forward.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:

> What does the time depend on?

We need to wait for all current transactions to complete. (i.e. any
backend that has (or could) take an xid or an AccessExclusiveLock before
it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.

The standby already performs this wait in the case where we overflow the
snapshot, so we have >64 subtransactions on *any* current transaction on
the master. The reason for that is (again) performance on master: we
choose not to WAL log new subtransactions.

There are various ways around this and I'm certain we'll come up with
something ingenious but my main point is that we don't need to wait for
this issue to be solved in order for HS to be usable.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Magnus Hagander
Date:
On Sunday, November 15, 2009, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote:
>
>> What does the time depend on?
>
> We need to wait for all current transactions to complete. (i.e. any
> backend that has (or could) take an xid or an AccessExclusiveLock before
> it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.
>
> The standby already performs this wait in the case where we overflow the
> snapshot, so we have >64 subtransactions on *any* current transaction on
> the master. The reason for that is (again) performance on master: we
> choose not to WAL log new subtransactions.
>
> There are various ways around this and I'm certain we'll come up with
> something ingenious but my main point is that we don't need to wait for
> this issue to be solved in order for HS to be usable.


Yeah, with that explanation (thanks for clearing it up) I agree - it
will definitely still be hugely useful even with this restriction, so
we realy don't need to delay an initial (or the alpha at least)
commit.

Thus, +1 on the second one as well :)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> We need to wait for all current transactions to complete. (i.e. any
> backend that has (or could) take an xid or an AccessExclusiveLock before
> it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY.
> 
> The standby already performs this wait in the case where we overflow the
> snapshot, so we have >64 subtransactions on *any* current transaction on
> the master. The reason for that is (again) performance on master: we
> choose not to WAL log new subtransactions.

WAL-logging every new subtransaction wouldn't actually help. The problem
with subtransactions is that if the subxid cache overflows in the proc
array in the master, the information about the parent-child
relationshiop is only stored in pg_subtrans, not in proc array. So when
we take the running-xacts snapshot, we can't include that information,
because there's no easy and fast way to scan pg_subtrans for it. Because
that information is not included in the snapshot, the standby doesn't
have all the information it needs until after it has seen that all the
transactions that had an overflowed xid cache have finished.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> * Issues relating to handling of prepared transactions
> There are some delicate issues surrounding what happens at the end of
> recovery if there is a prepared transaction still holding an access
> exclusive lock.

Can you describe in more detail what problem this is again? We had
various problems with prepared transactions, but I believe what's in the
git repository now handles all those cases (although I just noticed and
fixed a bug in it, so it's not very well tested or reviewed yet).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Robert Haas
Date:
On Sun, Nov 15, 2009 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Please can we agree a way forwards?

I don't have a strong position on the technical issues, but I am very
much in favor of getting something committed, even something with
limitations, as soon as we practically can.  Getting this feature into
the tree will get a lot more eyeballs on it, and it's much better to
do that now, while we still have several months remaining before beta,
so those eyeballs can be looking at it for longer - and testing it as
part of the regular alpha release process.  It will also eliminate the
need to repeatedly merge with the main tree, etc.

...Robert


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> There are two remaining areas of significant thought/effort:

Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.

- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.

- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.

- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.

- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.

- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master.  After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?


--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > There are two remaining areas of significant thought/effort:
> 
> Here's a list of other TODO items I've collected so far. Some of them
> are just improvements or nice-to-have stuff, but some are more serious:
> 
> - If WAL recovery runs out of lock space while acquiring an
> AccessExclusiveLock on behalf of a transaction that ran in the master,
> it will FATAL and abort recovery, bringing down the standby. Seems like
> it should wait/cancel queries instead.

Hard resources will always be an issue. If the standby has less than it
needs, then there will be problems. All of those can be corrected by
increasing the resources on the standby and restarting. This effects
max_connections, max_prepared_transactions, max_locks_per_transaction,
as documented.

> - When switching from standby mode to normal operation, we momentarily
> hold all AccessExclusiveLocks held by prepared xacts twice, needing
> twice the lock space. You can run out of lock space at that point,
> causing failover to fail.

That was the issue I mentioned.

> - When replaying b-tree deletions, we currently wait out/cancel all
> running (read-only) transactions. We take the ultra-conservative stance
> because we don't know how recent the tuples being deleted are. If we
> could store a better estimate for latestRemovedXid in the WAL record, we
> could make that less conservative.

Exactly my point. There are already parts of the patch that may cause
usage problems and need further thought. The earlier we get this to
people the earlier we will find out what they all are and begin doing
something about them.

> - The assumption that b-tree vacuum records don't need conflict
> resolution because we did that with the additional cleanup-info record
> works ATM, but it hinges on the fact that we don't delete any tuples
> marked as killed while we do the vacuum. That seems like a low-hanging
> fruit that I'd actually like to do now that I spotted it, but will then
> need to fix b-tree vacuum records accordingly. We'd probably need to do
> something about the previous item first to keep performance acceptable.
> 
> - There's the optimization to replay of b-tree vacuum records that we
> discussed earlier: Replay has to touch all leaf pages because of the
> interlock between heap scans, to ensure that we don't vacuum away a heap
> tuple that a concurrent index scan is about to visit. Instead of
> actually reading in and pinning all pages, during replay we could just
> check that the pages that don't need any other work to be done are not
> currently pinned in the buffer cache.

Yes, its an optimization. Not one I consider critical, yet cool and
interesting.

> - Do we do the b-tree page pinning explained in previous point correctly
> at the end of index vacuum? ISTM we're not visiting any pages after the
> last page that had dead tuples on it.

Looks like a new bug, not previously mentioned.

> - code structure. I moved much of the added code to a new standby.c
> module that now takes care of replaying standby related WAL records. But
> there's code elsewhere too. I'm not sure if this is a good division but
> seems better than the original ad hoc arrangement where e.g lock-related
> WAL handling was in inval.c

> - The "standby delay" is measured as current timestamp - timestamp of
> last replayed commit record. If there's little activity in the master,
> that can lead to surprising results. For example, imagine that
> max_standby_delay is set to 8 hours. The standby is fully up-to-date
> with the master, and there's no write activity in master.  After 10
> hours, a long reporting query is started in the standby. Ten minutes
> later, a small transaction is executed in the master that conflicts with
> the reporting query. I would expect the reporting query to be canceled 8
> hours after the conflicting transaction began, but it is in fact
> canceled immediately, because it's over 8 hours since the last commit
> record was replayed.

An issue that will be easily fixable with streaming, since it
effectively needs a heartbeat to listen to. Adding a regular stream of
WAL records is also possible, but there is no need, unless streaming is
somehow in doubt. Again, there is work to do once both are in.

> - ResolveRecoveryConflictWithVirtualXIDs polls until the victim
> transactions have ended. It would be much nicer to sleep. We'd need a
> version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
> patch for lock timeouts recently. Maybe we could borrow code from that?

Nice? 

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Greg Stark
Date:
On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> - The "standby delay" is measured as current timestamp - timestamp of
>> last replayed commit record. If there's little activity in the master,
>> that can lead to surprising results. For example, imagine that
>> max_standby_delay is set to 8 hours. The standby is fully up-to-date
>> with the master, and there's no write activity in master.  After 10
>> hours, a long reporting query is started in the standby. Ten minutes
>> later, a small transaction is executed in the master that conflicts with
>> the reporting query. I would expect the reporting query to be canceled 8
>> hours after the conflicting transaction began, but it is in fact
>> canceled immediately, because it's over 8 hours since the last commit
>> record was replayed.
>
> An issue that will be easily fixable with streaming, since it
> effectively needs a heartbeat to listen to. Adding a regular stream of
> WAL records is also possible, but there is no need, unless streaming is
> somehow in doubt. Again, there is work to do once both are in.

I don't think you need a heartbeat to solve this particular case. You
just need to define the "standby delay" to be "current timestamp -
timestamp of the conflicting candidate commit record".


--
greg


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
>> - If WAL recovery runs out of lock space while acquiring an
>> AccessExclusiveLock on behalf of a transaction that ran in the master,
>> it will FATAL and abort recovery, bringing down the standby. Seems like
>> it should wait/cancel queries instead.
> 
> Hard resources will always be an issue. If the standby has less than it
> needs, then there will be problems. All of those can be corrected by
> increasing the resources on the standby and restarting. This effects
> max_connections, max_prepared_transactions, max_locks_per_transaction,
> as documented.

There's no safe setting for those that would let you avoid the issue. No
matter how high you set them, it will be possible for read-only backends
to consume all the lock space, causing recovery to abort and bring down
the standby.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 14:47 +0000, Greg Stark wrote:
> On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> - The "standby delay" is measured as current timestamp - timestamp of
> >> last replayed commit record. If there's little activity in the master,
> >> that can lead to surprising results. For example, imagine that
> >> max_standby_delay is set to 8 hours. The standby is fully up-to-date
> >> with the master, and there's no write activity in master.  After 10
> >> hours, a long reporting query is started in the standby. Ten minutes
> >> later, a small transaction is executed in the master that conflicts with
> >> the reporting query. I would expect the reporting query to be canceled 8
> >> hours after the conflicting transaction began, but it is in fact
> >> canceled immediately, because it's over 8 hours since the last commit
> >> record was replayed.
> >
> > An issue that will be easily fixable with streaming, since it
> > effectively needs a heartbeat to listen to. Adding a regular stream of
> > WAL records is also possible, but there is no need, unless streaming is
> > somehow in doubt. Again, there is work to do once both are in.
> 
> I don't think you need a heartbeat to solve this particular case. You
> just need to define the "standby delay" to be "current timestamp -
> timestamp of the conflicting candidate commit record".

That's not possible unfortunately.

We only have times for commits and aborts. So there could be untimed WAL
records ahead of the last timed record.

The times of events we know from the log records give us no clue as to
when the last non-commit/abort record arrived. We can only do that by

(i) specifically augmenting the log with regular, timed WAL records, or
(ii) asking WALreceiver directly when it last spoke with the master

(ii) is the obvious way to do this when we have streaming replication,
and HS assumes this will be available. It need not, and we can do (i)

Heikki's case is close to one I would expect to see in many cases: a
database that is only active during day feeds a system that runs queries
24x7. Run a VACUUM on the master at night and you could get conflicts
that follow the pattern described.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:50 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> >> - If WAL recovery runs out of lock space while acquiring an
> >> AccessExclusiveLock on behalf of a transaction that ran in the master,
> >> it will FATAL and abort recovery, bringing down the standby. Seems like
> >> it should wait/cancel queries instead.
> > 
> > Hard resources will always be an issue. If the standby has less than it
> > needs, then there will be problems. All of those can be corrected by
> > increasing the resources on the standby and restarting. This effects
> > max_connections, max_prepared_transactions, max_locks_per_transaction,
> > as documented.
> 
> There's no safe setting for those that would let you avoid the issue. No
> matter how high you set them, it will be possible for read-only backends
> to consume all the lock space, causing recovery to abort and bring down
> the standby.

It can still fail even after we kick everybody off. So why bother? Most
people run nowhere near the size limit of their lock tables, and on the
standby we only track AccessExclusiveLocks in the Startup process. We
gain little by spending time on partial protection against an unlikely
issue.

(BTW, I'm not suggesting you commit HS immediately. Only that we split
into phases, stabilise and test pase 1 soon, then fix the remaining
issues later.)

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Robert Haas
Date:
On Sun, Nov 15, 2009 at 9:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon Riggs wrote:
>> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
>>> - If WAL recovery runs out of lock space while acquiring an
>>> AccessExclusiveLock on behalf of a transaction that ran in the master,
>>> it will FATAL and abort recovery, bringing down the standby. Seems like
>>> it should wait/cancel queries instead.
>>
>> Hard resources will always be an issue. If the standby has less than it
>> needs, then there will be problems. All of those can be corrected by
>> increasing the resources on the standby and restarting. This effects
>> max_connections, max_prepared_transactions, max_locks_per_transaction,
>> as documented.
>
> There's no safe setting for those that would let you avoid the issue. No
> matter how high you set them, it will be possible for read-only backends
> to consume all the lock space, causing recovery to abort and bring down
> the standby.

OK, but... there will always be things that will bring down the
stand-by, just as there will always be things that can bring down the
primary.  A bucket of ice-water will probably do it, for example.  I
mean, it would be great to make it better, but is it so bad that we
can't postpone that improvement to a follow-on patch?  It's not clear
to me that it is.  I think we should really focus in on things that
are likely to make this (1) give wrong answers or (2) won't be able to
be fixed in a follow-on patch if they're not right in the original
one.  Only one or two of the items on your list of additional TODOs
seem like they might fall into category (2), and none of them appear
to fall into category (1).

I predict that if we commit a basic version of this with some annoying
limitations for 8.5, people who need the feature will start writing
patches to address some of the limitations.  No one else is going to
undertake any serious development work as long as this remains outside
the main tree, for fear of everything changing under them and all
their work being wasted.  I would like this feature to be as good as
possible, but I would like to have it at all more.

...Robert


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> OK, but... there will always be things that will bring down the
> stand-by, just as there will always be things that can bring down the
> primary.  A bucket of ice-water will probably do it, for example. I
> mean, it would be great to make it better, but is it so bad that we
> can't postpone that improvement to a follow-on patch?

We're not talking about a bucket of ice-water. We're talking about
issuing SELECTs to a lot of different tables in a single transaction.

>  Only one or two of the items on your list of additional TODOs
> seem like they might fall into category (2), and none of them appear
> to fall into category (1).

At least the b-tree vacuum bug could cause incorrect answers, even
though it would be extremely hard to run into it in practice.

> I predict that if we commit a basic version of this with some annoying
> limitations for 8.5, people who need the feature will start writing
> patches to address some of the limitations.  No one else is going to
> undertake any serious development work as long as this remains outside
> the main tree, for fear of everything changing under them and all
> their work being wasted.  I would like this feature to be as good as
> possible, but I would like to have it at all more.

Agreed. Believe me, I'd like to have this committed as much as everyone
else. But once I do that, I'm also committing myself to fix all the
remaining issues before the release. The criteria for committing is: is
it good enough that we could release it tomorrow with no further
changes? Nothing more, nothing less.

We have *already* postponed a lot of nice-to-have stuff like the
functions to control recovery. And yes, many of the things I listed in
the TODO are not must-haves and we could well release without them.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

> The assumption that b-tree vacuum records don't need conflict
> resolution because we did that with the additional cleanup-info record
> works ATM, but it hinges on the fact that we don't delete any tuples
> marked as killed while we do the vacuum. 

> That seems like a low-hanging
> fruit that I'd actually like to do now that I spotted it, but will
> then need to fix b-tree vacuum records accordingly. We'd probably need
> to do something about the previous item first to keep performance
> acceptable.

We can optimise that by using the xlog pointer of the HeapInfo record.
Any blocks cleaned that haven't been further updated can avoid
generating further btree deletion records. If you do this the
straightforward way then it will just generate a stream of btree
deletion records that will ruin usability.

You spotted this issue only this morning??

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> 
>> The assumption that b-tree vacuum records don't need conflict
>> resolution because we did that with the additional cleanup-info record
>> works ATM, but it hinges on the fact that we don't delete any tuples
>> marked as killed while we do the vacuum. 
> 
>> That seems like a low-hanging
>> fruit that I'd actually like to do now that I spotted it, but will
>> then need to fix b-tree vacuum records accordingly. We'd probably need
>> to do something about the previous item first to keep performance
>> acceptable.
> 
> We can optimise that by using the xlog pointer of the HeapInfo record.
> Any blocks cleaned that haven't been further updated can avoid
> generating further btree deletion records.

Sorry, I don't understand that. (Remember that marking index tuples as
killed is not WAL-logged.)

> You spotted this issue only this morning??

Yesterday evening.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> > 
> >> The assumption that b-tree vacuum records don't need conflict
> >> resolution because we did that with the additional cleanup-info record
> >> works ATM, but it hinges on the fact that we don't delete any tuples
> >> marked as killed while we do the vacuum. 
> > 
> >> That seems like a low-hanging
> >> fruit that I'd actually like to do now that I spotted it, but will
> >> then need to fix b-tree vacuum records accordingly. We'd probably need
> >> to do something about the previous item first to keep performance
> >> acceptable.
> > 
> > We can optimise that by using the xlog pointer of the HeapInfo record.
> > Any blocks cleaned that haven't been further updated can avoid
> > generating further btree deletion records.
> 
> Sorry, I don't understand that. (Remember that marking index tuples as
> killed is not WAL-logged.)

Remember that blocks are marked with an LSN? When we insert a WAL record
it has an LSN also. So we can tell which btree blocks might have had
been written to after the HeapInfo record is generated. So if a block
hasn't been recently updated or it doesn't have any killed tuples then
we need not generate a record to handle a possible conflict.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Robert Haas
Date:
On Sun, Nov 15, 2009 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Agreed. Believe me, I'd like to have this committed as much as everyone
> else. But once I do that, I'm also committing myself to fix all the
> remaining issues before the release. The criteria for committing is: is
> it good enough that we could release it tomorrow with no further
> changes? Nothing more, nothing less.

I agree with the criteria but I think their application to the present
set of facts is debatable.  If the b-tree vacuum bug can cause
incorrect answers, then it is a bug and we have to fix it.  But a
query getting canceled because it touches a lot of tables sounds more
like a limitation than an outright bug, and I'm not sure you should
feel like you're on the hook for that, especially if the problem can
be mitigated by adjusting settings.  Of course, on the flip side, if
the problem is likely to occur frequently enough to make the whole
system unusable in practice, then maybe it does need to be fixed.  I
don't know.  It's not my place and I don't intend to question your
technical judgment on what does or does not need to be fixed, the
moreso since I haven't read or thought deeply about the latest patch.
I'm just throwing it out there.

The other problem is that we have another big patch sitting right
behind this one waiting for your attention as soon as you get this one
off your chest.  I know Simon has said that he feels that the effort
to finish the HS and SR patches for 9/15 was somewhat of an artificial
deadline, but ISTM that with only 3 months remaining until the close
of the final CommitFest for this release, and two major patches to
merged, we're starting to get tight on time.  Presumably there will be
problems with both patches that are discovered only after committing
them, and we need some time for those to shake out.  If not enough of
that shaking out happens during the regular development cycle, it will
either prolong beta and therefore delay the release, or the release
will be buggy.

All that having been said, the possibility that I'm a pessimistic
worry-wort certainly can't be ruled out.  :-)

...Robert


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
>>>
>>>> The assumption that b-tree vacuum records don't need conflict
>>>> resolution because we did that with the additional cleanup-info record
>>>> works ATM, but it hinges on the fact that we don't delete any tuples
>>>> marked as killed while we do the vacuum. 
>>>> That seems like a low-hanging
>>>> fruit that I'd actually like to do now that I spotted it, but will
>>>> then need to fix b-tree vacuum records accordingly. We'd probably need
>>>> to do something about the previous item first to keep performance
>>>> acceptable.
>>> We can optimise that by using the xlog pointer of the HeapInfo record.
>>> Any blocks cleaned that haven't been further updated can avoid
>>> generating further btree deletion records.
>> Sorry, I don't understand that. (Remember that marking index tuples as
>> killed is not WAL-logged.)
> 
> Remember that blocks are marked with an LSN? When we insert a WAL record
> it has an LSN also. So we can tell which btree blocks might have had
> been written to after the HeapInfo record is generated. So if a block
> hasn't been recently updated or it doesn't have any killed tuples then
> we need not generate a record to handle a possible conflict.

Hmm, perhaps we're talking about the same thing. What I'm seeing is that
we could easily do this:

*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 843,855 **** restart:                  offnum <= maxoff;                  offnum = OffsetNumberNext(offnum))
    {                 IndexTuple    itup;                 ItemPointer htup;
 

!                 itup = (IndexTuple) PageGetItem(page,
!                                                 PageGetItemId(page, offnum));                 htup = &(itup->t_tid);
!                 if (callback(htup, callback_state))                     deletable[ndeletable++] = offnum;
}        }
 
--- 843,856 ----                  offnum <= maxoff;                  offnum = OffsetNumberNext(offnum))             {
+                 ItemId        itemid;                 IndexTuple    itup;                 ItemPointer htup;

!                 itemid = PageGetItemId(page, offnum);
!                 itup = (IndexTuple) PageGetItem(page, itemid);                 htup = &(itup->t_tid);
!                 if (callback(htup, callback_state) || ItemIdIsDead(itemid))
deletable[ndeletable++]= offnum;             }         }
 

But if we do that, b-tree vacuum records are going to need conflict
resolution, just like the b-tree non-vacuum deletion records. The LSN
doesn't help there, because when an itemid is marked as dead, the LSN is
not updated.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> But a
> query getting canceled because it touches a lot of tables sounds more
> like a limitation than an outright bug, 

It's not that the query might get canceled. It will abort WAL recovery,
kill all backends, and bring the whole standby down.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 13:15 -0500, Robert Haas wrote:
> I know Simon has said that he feels that the effort
> to finish the HS and SR patches for 9/15 was somewhat of an artificial
> deadline, but ISTM that with only 3 months remaining until the close
> of the final CommitFest for this release, and two major patches to
> merged, we're starting to get tight on time.

As of further concerns about initial snapshot conditions, I agree we are
now tight on time.

> Presumably there will be
> problems with both patches that are discovered only after committing
> them, and we need some time for those to shake out.  If not enough of
> that shaking out happens during the regular development cycle, it will
> either prolong beta and therefore delay the release, or the release
> will be buggy.

I'm not worried about bugs. Fixes for those can go in anytime.

Missing features and small usability enhancements will be forced to wait
another year and cause upgrades for early adopters. That worries me.
REL8_0 shipped with an unusable bgwriter implementation and I've always
been wary of the need for minor tweaks late in a release since then.

I've not asked for an immediate commit, but we do need an agreed period
patch stability to allow testing, prior to a commit.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote:

> The LSN doesn't help there, because when an itemid is marked as dead,
> the LSN is not updated.

I was thinking we could update the index block LSN without writing WAL
using the LSN of the heap block that leads to the killed tuple.
Pretending that the block might need flushing won't do much harm. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote:
> 
>> The LSN doesn't help there, because when an itemid is marked as dead,
>> the LSN is not updated.
> 
> I was thinking we could update the index block LSN without writing WAL
> using the LSN of the heap block that leads to the killed tuple.

That can be before the cleanup record we write before we start the index
vacuum.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 20:37 +0200, Heikki Linnakangas wrote:
> Robert Haas wrote:
> > But a
> > query getting canceled because it touches a lot of tables sounds more
> > like a limitation than an outright bug, 
> 
> It's not that the query might get canceled. It will abort WAL recovery,
> kill all backends, and bring the whole standby down.

Hmm, I think the incredible exploding Hot Standby is overstating this
somewhat. We can improve the error handling for this rare case for which
a simple workaround exists, but it seems like we should punt to phase 2.

You agree there should be two phases?

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Greg Stark
Date:
On Sun, Nov 15, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> You agree there should be two phases?
>

I don't understand this repeated suggestion of "phases". Nobody's
every suggested that we would refuse to add new features to HS after
the initial commit or the 8.5 release. Of course there should be later
features if you or anyone else is interested in working on them.

Or are asking whether we should commit it before it's a usable subset
of the functionality? Personally I am in favour of earlier more
fine-grained commits but I think the horse has left the stable on that
one. We have a usable subset of the functionality in this patch
already, don't we?

-- 
greg


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 21:20 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote:
> > 
> >> The LSN doesn't help there, because when an itemid is marked as dead,
> >> the LSN is not updated.
> > 
> > I was thinking we could update the index block LSN without writing WAL
> > using the LSN of the heap block that leads to the killed tuple.
> 
> That can be before the cleanup record we write before we start the index
> vacuum.

Oh well. Strike 1.

But the technique sounds OK, we just need to get the LSN of a HeapInfo
record from somewhere, say, index metapage. Sounds like we need to do
something similar with the xid.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> You agree there should be two phases?

I'm hesitant to say 'yes', because then you will harass me with "but you
said that you would be OK with fixing X, Y, Z later! Why don't you
commit already!".

Of course there should be several phases! We've *already* punted a lot
of stuff from this first increment we're currently working on. The
criteria for getting this first phase committed is: could we release
with no further changes?

If you actually want to help, can you please focus on fixing the
must-fix bugs we know about? We can then discuss which of the remaining
known issues we're willing to live with.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 21:20 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote:
>>>
>>>> The LSN doesn't help there, because when an itemid is marked as dead,
>>>> the LSN is not updated.
>>> I was thinking we could update the index block LSN without writing WAL
>>> using the LSN of the heap block that leads to the killed tuple.
>> That can be before the cleanup record we write before we start the index
>> vacuum.
> 
> Oh well. Strike 1.
> 
> But the technique sounds OK, we just need to get the LSN of a HeapInfo
> record from somewhere, say, index metapage. Sounds like we need to do
> something similar with the xid.

I'm thinking that we should address the general issue, not just with
vacuum-related deletion records. For the vacuum-related deletion
records, we can just leave the code as it is. I think we talked about
various approaches about a year ago when we first realized that killed
index tuples are a problem, though I don't think we carved out a full
solution.

We could for example stored the xmax (or xmin if it was inserted by an
aborted transaction) of the killed tuple in the b-tree page header
whenever we mark an index tuple as dead. We could then include that in
the WAL record. The trick is how to make that crash-safe.

(but this whole thing is certainly something we can defer until later)

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 21:56 +0200, Heikki Linnakangas wrote:

> If you actually want to help, can you please focus on fixing the
> must-fix bugs we know about? We can then discuss which of the
> remaining known issues we're willing to live with.

I intend to work on all of the issues, so not sure what you mean by
help. When the role of author and reviewer becomes blurred it gets
harder to work together, for certain.

Since we are short of time and some issues will take time, the priority
order of further work is important. Right now, I don't know which you
consider to be the must-fix issues, hence the thread. I also don't know
what you consider to be appropriate fixes to them, so unfortunately
there will be more talking until it is time for action. I prefer coding,
just like you.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Right now, I don't know which you
> consider to be the must-fix issues, hence the thread.

Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the
index pages after the last b-tree vacuum record? Thanks.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Right now, I don't know which you
> > consider to be the must-fix issues, hence the thread.
> 
> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the
> index pages after the last b-tree vacuum record? Thanks.

That's all? You sure?

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Josh Berkus
Date:
On 11/15/09 12:58 PM, Simon Riggs wrote:
> On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> Right now, I don't know which you
>>> consider to be the must-fix issues, hence the thread.
>> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the
>> index pages after the last b-tree vacuum record? Thanks.
> 
> That's all? You sure?

Just speaking from a user/tester perspective, a HS with known caveats
and failure conditions would be acceptable in Alpha3.  It would be
better than waiting for Alpha4.

Not only would getting some form of HS into Alpha3 get people testing HS
and finding failure conditions we didn't think of eariler, it will also
inspire people to compile and test the Alphas, period.  Right now the
whole Alpha testing program seems to have only attracted The Usual
Contributors, despite efforts to publicize it.

So I'm in favor of committing part of the HS code even if there are
known failure conditions, as long as those conditions are well-defined.

(and applause to Simon and Heikki for continuing to put noses to
grinstones on this, and Robert for keeping an eye on the schedule)

--Josh Berkus



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> Right now, I don't know which you
>>> consider to be the must-fix issues, hence the thread.
>> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the
>> index pages after the last b-tree vacuum record? Thanks.
> 
> That's all? You sure?

For starters. If you think you'll get that done quickly, please take a
look at the "bucket of ice-water" issue next.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 23:14 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> Right now, I don't know which you
> >>> consider to be the must-fix issues, hence the thread.
> >> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the
> >> index pages after the last b-tree vacuum record? Thanks.
> > 
> > That's all? You sure?
> 
> For starters. If you think you'll get that done quickly, please take a
> look at the "bucket of ice-water" issue next.

Sure, I'll see if I can reach for the bucket.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> So I'm in favor of committing part of the HS code even if there are
> known failure conditions, as long as those conditions are well-defined.

If we're thinking of committing something that is known broken, I would
want to have a clearly defined and trust-inspiring escape strategy.
"We can always revert the patch later" inspires absolutely zero
confidence here, because in a patch this large there are always going to
be overlaps with other later patches.  If it gets to be February and HS
is still unshippable, reverting is going to be a tricky and risky
affair.

I agree with Heikki that it would be better not to commit as long as
any clear showstoppers remain unresolved.
        regards, tom lane


Re: Summary and Plan for Hot Standby

From
Robert Haas
Date:
On Nov 15, 2009, at 4:19 PM, Simon Riggs <simon@2ndQuadrant.com> wrote:

> On Sun, 2009-11-15 at 23:14 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> Right now, I don't know which you
>>>>> consider to be the must-fix issues, hence the thread.
>>>> Ok, could you tackle the b-tree vacuum bug, where we neglect to  
>>>> pin the
>>>> index pages after the last b-tree vacuum record? Thanks.
>>>
>>> That's all? You sure?
>>
>> For starters. If you think you'll get that done quickly, please  
>> take a
>> look at the "bucket of ice-water" issue next.
>
> Sure, I'll see if I can reach for the bucket.

Me and my big fat mouth...

...Robert


Re: Summary and Plan for Hot Standby

From
"David E. Wheeler"
Date:
On Nov 15, 2009, at 2:17 PM, Tom Lane wrote:

>> So I'm in favor of committing part of the HS code even if there are
>> known failure conditions, as long as those conditions are well-defined.
> 
> If we're thinking of committing something that is known broken, I would
> want to have a clearly defined and trust-inspiring escape strategy.
> "We can always revert the patch later" inspires absolutely zero
> confidence here, because in a patch this large there are always going to
> be overlaps with other later patches.  If it gets to be February and HS
> is still unshippable, reverting is going to be a tricky and risky
> affair.
> 
> I agree with Heikki that it would be better not to commit as long as
> any clear showstoppers remain unresolved.

If ever there were an argument for topic branches, *this is it*.

Best,

David


Re: Summary and Plan for Hot Standby

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Nov 15, 2009, at 2:17 PM, Tom Lane wrote:
>> I agree with Heikki that it would be better not to commit as long as
>> any clear showstoppers remain unresolved.

> If ever there were an argument for topic branches, *this is it*.

How so?  They've got a perfectly good topic branch, ie, the external
git repository they're already working in.  If the branch were within
core CVS it would accomplish exactly nothing more as far as easing the
eventual merge.
        regards, tom lane


Re: Summary and Plan for Hot Standby

From
Tatsuo Ishii
Date:
Just a question:

- Does Hot Standby allow to use prepared query (not prepared transaction) in standby? I mean: Parse message from
frontendcan be accepted by standby?
 

- Can we create tempory tables in standby?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> After some time thinking about the best way forward for Hot Standby, I
> have some observations and proposals.
> 
> First, the project is very large. We have agreed ways to trim the patch,
> yet it remains large. Trying to do everything in one lump is almost
> always a bad plan, so we need to phase things.
> 
> Second, everybody is keen that HS hits the tree, so we can have alpha
> code etc.. There are a few remaining issues that should *not* be rushed.
> The only way to remove this dependency is to decouple parts of the
> project.
> 
> Third, testing the patch is difficult and continuous change makes it
> harder to guarantee everything is working.
> 
> There are two remaining areas of significant thought/effort:
> 
> * Issues relating to handling of prepared transactions
> * How fast Hot Standby mode is enabled in the standby
> 
> I propose that we stabilise and eventually commit a version of HS that
> circumvents/defers those issues and then address the issues with
> separate patches afterwards. This approach will allow us to isolate the
> areas of further change so we can have a test blitz to remove silly
> mistakes, then follow it with a commit to CVS, and then release as Alpha
> to allow further testing.
> 
> Let's look at the two areas of difficulty in more detail
> 
> * Issues relating to handling of prepared transactions
> There are some delicate issues surrounding what happens at the end of
> recovery if there is a prepared transaction still holding an access
> exclusive lock. It is straightforward to say, as an interim measure,
> "Hot Standby will not work with max_prepared_transactions > 0". I see
> that this has a fiddly, yet fairly clear solution.
> 
> * How fast Hot Standby mode is enabled in the standby
> We need to have full snapshot information on the standby before we can
> allow connections and queries. There are two basic approaches: i) we
> wait until we *know* we have full info or ii) we try to collect data and
> inject a correct starting condition. Waiting (i) may take a while, but
> is clean and requires only a few lines of code. Injecting the starting
> condition (ii) requires boatloads of hectic code and we have been unable
> to agree a way forwards. If we did have that code, all it would give us
> is a faster/more reliable starting point for connections on the standby.
> Until we can make approach (ii) work, we should just rely on the easy
> approach (i). In many cases, the starting point is very similar. (In
> some cases we can actually make (i) faster because the overhead of data
> collection forces us to derive the starting conditions minutes apart.)
> 
> Phasing the commit seems like the only way.
> 
> Please can we agree a way forwards?
> 
> -- 
>  Simon Riggs           www.2ndQuadrant.com
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Tatsuo Ishii wrote:
> Just a question:
> 
> - Does Hot Standby allow to use prepared query (not prepared
>   transaction) in standby? I mean: Parse message from frontend can be
>   accepted by standby?

Yes.

> - Can we create tempory tables in standby?

No, because creating a temporary table needs to write to the catalogs.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Tatsuo Ishii
Date:
> Tatsuo Ishii wrote:
> > Just a question:
> > 
> > - Does Hot Standby allow to use prepared query (not prepared
> >   transaction) in standby? I mean: Parse message from frontend can be
> >   accepted by standby?
> 
> Yes.

In my understanding, Parse will aquire locks on the target table. Is
this harmless?
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Tatsuo Ishii wrote:
> In my understanding, Parse will aquire locks on the target table. Is
> this harmless?

That's ok, you can take AccessShareLocks in a standby. All queries lock
the target table (in AccessShare mode).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Mon, 2009-11-16 at 13:23 +0900, Tatsuo Ishii wrote:
> Just a question:
> 
> - Does Hot Standby allow to use prepared query (not prepared
>   transaction) in standby? I mean: Parse message from frontend can be
>   accepted by standby?

Yes, no problem with any of those kind of facilities

> - Can we create tempory tables in standby?

No, but this is for two reasons

* CREATE TEMPORARY TABLE actually writes to catalog tables. It doesn't
need to do that, so allowing this would require some medium-heavy
lifting of the way temp tables work. A preliminary design was agreed in
July 2008. I believe it would be a popular feature, since about 40-50%
of people ask for this.

* CREATE TEMP TABLE is currently considered to be disallowed during read
only transactions. That might be able to change if the underlying
physical operation were write-free.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Tatsuo Ishii
Date:
> > - Does Hot Standby allow to use prepared query (not prepared
> >   transaction) in standby? I mean: Parse message from frontend can be
> >   accepted by standby?
> 
> Yes, no problem with any of those kind of facilities

Please correct me if I'm wrong. Parse will result in obtaining
RowExclusiveLock on the target table if it is parsing
INSERT/UPDATE/DELETE. If so, is this ok in the standby?

> > - Can we create tempory tables in standby?
> 
> No, but this is for two reasons
> 
> * CREATE TEMPORARY TABLE actually writes to catalog tables. It doesn't
> need to do that, so allowing this would require some medium-heavy
> lifting of the way temp tables work. A preliminary design was agreed in
> July 2008. I believe it would be a popular feature, since about 40-50%
> of people ask for this.
> 
> * CREATE TEMP TABLE is currently considered to be disallowed during read
> only transactions. That might be able to change if the underlying
> physical operation were write-free.

Thanks for explanation.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Summary and Plan for Hot Standby

From
"Kevin Grittner"
Date:
Tom Lane  wrote:
> I agree with Heikki that it would be better not to commit as long as
> any clear showstoppers remain unresolved.
I agree that it would be better not to commit as long as any of the
following are true:
(1)  There are any known issues which would break things for clusters    *not using* hot standby.
(2)  There isn't an easy way for to disable configuration of hot    standby.
(3)  There is significant doubt that the vast majority of the patch    will be useful in the eventually-enabled final
solution.
If none of these are true, I'm not sure what the down side of a commit
is.
-Kevin


Re: Summary and Plan for Hot Standby

From
Robert Haas
Date:
On Mon, Nov 16, 2009 at 11:07 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane  wrote:
>
>> I agree with Heikki that it would be better not to commit as long as
>> any clear showstoppers remain unresolved.
>
> I agree that it would be better not to commit as long as any of the
> following are true:
>
> (1)  There are any known issues which would break things for clusters
>     *not using* hot standby.
>
> (2)  There isn't an easy way for to disable configuration of hot
>     standby.
>
> (3)  There is significant doubt that the vast majority of the patch
>     will be useful in the eventually-enabled final solution.
>
> If none of these are true, I'm not sure what the down side of a commit
> is.

Well, I think you wouldn't want to commit something that enabled Hot
Standby but caused Hot Standby queries to give wrong answers, or
didn't even allow some/all queries to be executed.  That's fairly
pointless, and might mislead users into thinking we had a feature when
we really didn't.

...Robert


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Mon, 2009-11-16 at 19:06 +0900, Tatsuo Ishii wrote:
> > > - Does Hot Standby allow to use prepared query (not prepared
> > >   transaction) in standby? I mean: Parse message from frontend can be
> > >   accepted by standby?
> > 
> > Yes, no problem with any of those kind of facilities
> 
> Please correct me if I'm wrong. Parse will result in obtaining
> RowExclusiveLock on the target table if it is parsing
> INSERT/UPDATE/DELETE. If so, is this ok in the standby?

Any attempt to take RowExclusiveLock will fail.

Any attempt to execute INSERT/UPDATE/DELETE will fail.

This behaviour should be identical to read only transaction mode. If it
is not documented as an exception, please report as a bug.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Tatsuo Ishii
Date:
> > Please correct me if I'm wrong. Parse will result in obtaining
> > RowExclusiveLock on the target table if it is parsing
> > INSERT/UPDATE/DELETE. If so, is this ok in the standby?
> 
> Any attempt to take RowExclusiveLock will fail.
> 
> Any attempt to execute INSERT/UPDATE/DELETE will fail.
> 
> This behaviour should be identical to read only transaction mode. If it
> is not documented as an exception, please report as a bug.

Is it?

It seems read only transaction mode is perfectly happy with
RowExclusiveLock:

test=# begin;
BEGIN
test=# set transaction read only;
SET
test=# prepare a(int) as insert into t1 values($1);
PREPARE
test=# \x
Expanded display is on.
test=# select * from pg_locks;
-[ RECORD 1 ]------+-----------------
locktype           | relation
database           | 1297143
relation           | 10969
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 1/101699
pid                | 28020
mode               | AccessShareLock
granted            | t
-[ RECORD 2 ]------+-----------------
locktype           | virtualxid
database           | 
relation           | 
page               | 
tuple              | 
virtualxid         | 1/101699
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 1/101699
pid                | 28020
mode               | ExclusiveLock
granted            | t
-[ RECORD 3 ]------+-----------------
locktype           | relation
database           | 1297143
relation           | 1574918
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 1/101699
pid                | 28020
mode               | RowExclusiveLock
granted            | t

test=# select relname from pg_class where oid = 1574918;
-[ RECORD 1 ]
relname | t1
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Summary and Plan for Hot Standby

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 16, 2009 at 11:07 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Tom Lane  wrote:
>>
>>> I agree with Heikki that it would be better not to commit as long
>>> as any clear showstoppers remain unresolved.
>>
>> I agree that it would be better not to commit as long as any of the
>> following are true:
>>
>> (1)  There are any known issues which would break things for
>>      clusters *not using* hot standby.
>>
>> (2)  There isn't an easy way for to disable configuration of hot
>>      standby.
>>
>> (3)  There is significant doubt that the vast majority of the patch
>>      will be useful in the eventually-enabled final solution.
>>
>> If none of these are true, I'm not sure what the down side of a
>> commit is.
> 
> Well, I think you wouldn't want to commit something that enabled Hot
> Standby but caused Hot Standby queries to give wrong answers, or
> didn't even allow some/all queries to be executed.  That's fairly
> pointless, and might mislead users into thinking we had a feature
> when we really didn't.
I might.  Based on my project management experience and the tone of
the posts on this feature, I suspect that there would be benefit to
committing the code -- even if the ability to enable it was commented
out.  For starters, I suspect that most people won't be using it, so
the most important thing for most users is that the patch breaks
nothing when the feature is not configured.  Also, if we have high
confidence that the vast majority of this code will eventually be
committed, and likely in 8.5, the sooner it is what people work
against, the less likely that a late change will destabilize
something.
Having made that point, I'm happy to leave it to Heikki's judgment;
it's definitely not something I care enough about to argue at any
length....
-Kevin


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Tatsuo Ishii wrote:
>>> Please correct me if I'm wrong. Parse will result in obtaining
>>> RowExclusiveLock on the target table if it is parsing
>>> INSERT/UPDATE/DELETE. If so, is this ok in the standby?
>> Any attempt to take RowExclusiveLock will fail.
>>
>> Any attempt to execute INSERT/UPDATE/DELETE will fail.
>>
>> This behaviour should be identical to read only transaction mode. If it
>> is not documented as an exception, please report as a bug.
> 
> Is it?
> 
> It seems read only transaction mode is perfectly happy with
> RowExclusiveLock:

Hmm, that's a good point. I can't immediately see that that would cause
any trouble, but it gives me an uncomfortable feeling about the locking.
Which locks exactly need to be replayed in standby, and why? Which locks
can read-only transactions acquire?

The doc says:
+   In recovery, transactions will not be permitted to take any table lock
+   higher than AccessShareLock. In addition, transactions may never assign
+   a TransactionId and may never write WAL.
+   Any LOCK TABLE command that runs on the standby and requests a specific
+   lock type other than AccessShareLock will be rejected.

which seems wrong, given Tatsuo-sans example. Is that paragraph only
referring to LOCK TABLE, and not other means of acquiring locks? Either
way, it needs to be clarified or fixed.

access/transam/README says:
+Further details on locking mechanics in recovery are given in comments
+with the Lock rmgr code.

but there's no explanation there either *why* the locking works as it
is. In LockAcquire(), we forbid taking locks higher than AccessShareLock
during recovery mode, but only for LOCKTAG_OBJECT locks. Why?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Wed, 2009-11-18 at 14:51 +0200, Heikki Linnakangas wrote:
> Tatsuo Ishii wrote:
> >>> Please correct me if I'm wrong. Parse will result in obtaining
> >>> RowExclusiveLock on the target table if it is parsing
> >>> INSERT/UPDATE/DELETE. If so, is this ok in the standby?
> >> Any attempt to take RowExclusiveLock will fail.
> >>
> >> Any attempt to execute INSERT/UPDATE/DELETE will fail.
> >>
> >> This behaviour should be identical to read only transaction mode. If it
> >> is not documented as an exception, please report as a bug.
> > 
> > Is it?
> > 
> > It seems read only transaction mode is perfectly happy with
> > RowExclusiveLock:
> 
> Hmm, that's a good point. I can't immediately see that that would cause
> any trouble, but it gives me an uncomfortable feeling about the locking.
> Which locks exactly need to be replayed in standby, and why? Which locks
> can read-only transactions acquire?
> 
> The doc says:
> +   In recovery, transactions will not be permitted to take any table lock
> +   higher than AccessShareLock. In addition, transactions may never assign
> +   a TransactionId and may never write WAL.
> +   Any LOCK TABLE command that runs on the standby and requests a specific
> +   lock type other than AccessShareLock will be rejected.
> 
> which seems wrong, given Tatsuo-sans example. Is that paragraph only
> referring to LOCK TABLE, and not other means of acquiring locks? Either
> way, it needs to be clarified or fixed.
> 
> access/transam/README says:
> +Further details on locking mechanics in recovery are given in comments
> +with the Lock rmgr code.
> 
> but there's no explanation there either *why* the locking works as it
> is. In LockAcquire(), we forbid taking locks higher than AccessShareLock
> during recovery mode, but only for LOCKTAG_OBJECT locks. Why?

Recovery does *not* take the same locks as the original statements on
the master took. For example, the WAL record for an INSERT just makes
its changes without acquiring locks. This is OK as long as we only allow
read-only users to acquire AccessShareLocks. If we allowed higher locks
we might need to do deadlock detection, which would add more complexity.

The above restrictions are limited to LOCKTAG_OBJECT so that advisory
locks work as advertised. So advisory locks can take both shared and
exclusive locks. This never conflicts with recovery because advisory
locks are not WAL logged.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Recovery does *not* take the same locks as the original statements on
> the master took. For example, the WAL record for an INSERT just makes
> its changes without acquiring locks. This is OK as long as we only allow
> read-only users to acquire AccessShareLocks. If we allowed higher locks
> we might need to do deadlock detection, which would add more complexity.

But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans
example shows. Is that a bug?

> The above restrictions are limited to LOCKTAG_OBJECT so that advisory
> locks work as advertised. So advisory locks can take both shared and
> exclusive locks. This never conflicts with recovery because advisory
> locks are not WAL logged.

So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes
advisory locks, but also relation locks, tuple locks and page locks.

Looking at the lock types in detail:

LOCKTAG_RELATION

Any lock level is allowed. We have other defenses against actually
modifying a relation, but it feels a bit fragile and I got the
impression from your comments that it's not intentional.

LOCKTAG_RELATION_EXTEND

Any lock level is allowed. Again, we have other defenses against
modifying relations, but feels fragile.

LOCKTAG_PAGE

Any lock level is allowed. Page locks are only used when extending a
hash index, so it seems irrelevant what we do. I think we should
disallow page locks in standby altogether.

LOCKTAG_TUPLE,

Any lock level is allowed. Only used when locking a tuple for update. We
forbid locking tuples by the general "is the transaction read-only?"
check in executor, and if you manage to bypass that, you will fail to
get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple
locks.

LOCKTAG_TRANSACTION,

Any lock level is allowed. Acquired in AssignTransactionId, to allow
others to wait for the transaction to finish. We don't allow
AssignTransactionId() during recovery, but could someone want to wait
for a transaction to finish? All the current callers of
XactLockTableWait() seem to be from operations that are not allowed in
recovery. Should we take a conservative stance and disallow taking
transaction-locks?

LOCKTAG_VIRTUALTRANSACTION

Any lock level is allowed. Similar to transaction locks, but virtual
transaction locks are held by read-only transactions as well. Also
during recovery, and we rely on it in the code to wait for a conflicting
transaction to finish. But we don't acquire locks to represent
transactions in master.

LOCKTAG_OBJECT,

Anything higher than AccessShareLock is disallowed. Used by dependency
walking in pg_depend.c. Also used as interlock between database start
and DROP/CREATE DATABASE. At backend start, we normally take
RowExclusiveLock on the database in postinit.c, but you had to modify to
acquire AccessShareLock instead in standby mode.

LOCKTAG_USERLOCK
LOCKTAG_ADVISORY

Any lock level is allowed. As documented, advisory locks are per-server,
so a lock taken in master doesn't conflict with one taken in slave.


In any case, all this really needs to be documented in a README or
something.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Tatsuo Ishii
Date:
> Simon Riggs wrote:
> > Recovery does *not* take the same locks as the original statements on
> > the master took. For example, the WAL record for an INSERT just makes
> > its changes without acquiring locks. This is OK as long as we only allow
> > read-only users to acquire AccessShareLocks. If we allowed higher locks
> > we might need to do deadlock detection, which would add more complexity.
> 
> But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans
> example shows. Is that a bug?

Sorry for confusion. My example is under normal PostgreSQL, not under
HS enabled.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> > The above restrictions are limited to LOCKTAG_OBJECT so that advisory
> > locks work as advertised. So advisory locks can take both shared and
> > exclusive locks. This never conflicts with recovery because advisory
> > locks are not WAL logged.
> 
> So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes
> advisory locks, but also relation locks, tuple locks and page locks.
> 
> Looking at the lock types in detail:
> 
> LOCKTAG_RELATION
> 
> Any lock level is allowed. We have other defenses against actually
> modifying a relation, but it feels a bit fragile and I got the
> impression from your comments that it's not intentional.
> 
> LOCKTAG_RELATION_EXTEND
> 
> Any lock level is allowed. Again, we have other defenses against
> modifying relations, but feels fragile.
> 
> LOCKTAG_PAGE
> 
> Any lock level is allowed. Page locks are only used when extending a
> hash index, so it seems irrelevant what we do. I think we should
> disallow page locks in standby altogether.
> 
> LOCKTAG_TUPLE,
> 
> Any lock level is allowed. Only used when locking a tuple for update. We
> forbid locking tuples by the general "is the transaction read-only?"
> check in executor, and if you manage to bypass that, you will fail to
> get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple
> locks.
> 
> LOCKTAG_TRANSACTION,
> 
> Any lock level is allowed. Acquired in AssignTransactionId, to allow
> others to wait for the transaction to finish. We don't allow
> AssignTransactionId() during recovery, but could someone want to wait
> for a transaction to finish? All the current callers of
> XactLockTableWait() seem to be from operations that are not allowed in
> recovery. Should we take a conservative stance and disallow taking
> transaction-locks?
> 
> LOCKTAG_VIRTUALTRANSACTION
> 
> Any lock level is allowed. Similar to transaction locks, but virtual
> transaction locks are held by read-only transactions as well. Also
> during recovery, and we rely on it in the code to wait for a conflicting
> transaction to finish. But we don't acquire locks to represent
> transactions in master.
> 
> LOCKTAG_OBJECT,
> 
> Anything higher than AccessShareLock is disallowed. Used by dependency
> walking in pg_depend.c. Also used as interlock between database start
> and DROP/CREATE DATABASE. At backend start, we normally take
> RowExclusiveLock on the database in postinit.c, but you had to modify to
> acquire AccessShareLock instead in standby mode.
> 
> LOCKTAG_USERLOCK
> LOCKTAG_ADVISORY
> 
> Any lock level is allowed. As documented, advisory locks are per-server,
> so a lock taken in master doesn't conflict with one taken in slave.
> 
> 
> In any case, all this really needs to be documented in a README or
> something.
> 
> -- 
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Tatsuo Ishii wrote:
> Sorry for confusion. My example is under normal PostgreSQL, not under
> HS enabled.

You get the same result in standby:

postgres=# begin;
BEGIN
postgres=# prepare a(int) as insert into foo values($1);
PREPARE
postgres=# SELECT * FROM pg_locks; locktype  │ database │ relation │ page │ tuple │ virtualxid │
transactionid │
classid │ objid │ objsubid │ virtualtransaction │  pid  │       mode  │ gra
nted
────────────┼──────────┼──────────┼──────┼───────┼────────────┼───────────────┼─
────────┼───────┼──────────┼────────────────────┼───────┼──────────────────┼────
─────relation   │    11564 │    10968 │      │       │            │     │       │       │          │ 2/4
│10449 │
 
AccessShareLock  │ trelation   │    11564 │    16384 │      │       │            │     │       │       │          │ 2/4
              │ 10449 │
 
RowExclusiveLock │ tvirtualxid │          │          │      │       │ 1/1        │     │       │       │          │ 1/0
              │ 10419 │ ExclusiveLock  │ tvirtualxid │          │          │      │       │ 2/4        │     │       │
    │          │ 2/4                │ 10449 │ ExclusiveLock  │ t
 
(4 rows)

this is from a standby.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tatsuo Ishii wrote:
>> Sorry for confusion. My example is under normal PostgreSQL, not under
>> HS enabled.

> You get the same result in standby:

AFAICT Tatsuo's example just shows that we might wish to add a check
for read-only transaction mode before parsing an INSERT/UPDATE/DELETE
command.  But it seems relatively minor in any case --- at the worst
you'd get an unexpected error message, no?
        regards, tom lane


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tatsuo Ishii wrote:
>>> Sorry for confusion. My example is under normal PostgreSQL, not under
>>> HS enabled.
> 
>> You get the same result in standby:
> 
> AFAICT Tatsuo's example just shows that we might wish to add a check
> for read-only transaction mode before parsing an INSERT/UPDATE/DELETE
> command.  But it seems relatively minor in any case --- at the worst
> you'd get an unexpected error message, no?

Right, it's harmless AFAICS. And it might actually be useful to be able
to prepare all queries right after connecting, even though the
connection is in not yet read-write.

It's the documentation (in source code or README) that's lacking, and
perhaps we should add more explicit checks for the "can't happen" cases,
just in case.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Josh Berkus
Date:
On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
> - When replaying b-tree deletions, we currently wait out/cancel all
> running (read-only) transactions. We take the ultra-conservative stance
> because we don't know how recent the tuples being deleted are. If we
> could store a better estimate for latestRemovedXid in the WAL record, we
> could make that less conservative.

Simon was explaining this issue here at JPUGCon; now that I understand
it, this specific issue seems like the worst usability issue in HS now.Bad enough to kill its usefulness for users, or
evenour ability to get
 
useful testing data; in an OLTP production database with several hundred
inserts per second it would result in pretty much never being able to
get any query which takes longer than a few seconds to complete on the
slave.

--Josh Berkus




Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote:
> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
> > - When replaying b-tree deletions, we currently wait out/cancel all
> > running (read-only) transactions. We take the ultra-conservative stance
> > because we don't know how recent the tuples being deleted are. If we
> > could store a better estimate for latestRemovedXid in the WAL record, we
> > could make that less conservative.
>
> Simon was explaining this issue here at JPUGCon; now that I understand
> it, this specific issue seems like the worst usability issue in HS now.
>  Bad enough to kill its usefulness for users, or even our ability to get
> useful testing data; in an OLTP production database with several hundred
> inserts per second it would result in pretty much never being able to
> get any query which takes longer than a few seconds to complete on the
> slave.

I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers
that are doing more than that... This sounds pretty significant.

Joshua D. Drake


>
> --Josh Berkus
>
>
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Summary and Plan for Hot Standby

From
Andrew Dunstan
Date:

Joshua D. Drake wrote:
> On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote:
>   
>> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
>>     
>>> - When replaying b-tree deletions, we currently wait out/cancel all
>>> running (read-only) transactions. We take the ultra-conservative stance
>>> because we don't know how recent the tuples being deleted are. If we
>>> could store a better estimate for latestRemovedXid in the WAL record, we
>>> could make that less conservative.
>>>       
>> Simon was explaining this issue here at JPUGCon; now that I understand
>> it, this specific issue seems like the worst usability issue in HS now.
>>  Bad enough to kill its usefulness for users, or even our ability to get
>> useful testing data; in an OLTP production database with several hundred
>> inserts per second it would result in pretty much never being able to
>> get any query which takes longer than a few seconds to complete on the
>> slave.
>>     
>
> I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers
> that are doing more than that... This sounds pretty significant.
>
>   

Right. The major use I was hoping for from HS was exactly to be able to 
run long-running queries. In once case I am thinking of we have moved 
the business intelligence uses off the OLTP server onto a londiste 
replica, and I was really wanting to move that to a Hot Standby server.

cheers

andrew


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Thu, 2009-11-19 at 10:13 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Recovery does *not* take the same locks as the original statements on
> > the master took. For example, the WAL record for an INSERT just makes
> > its changes without acquiring locks. This is OK as long as we only allow
> > read-only users to acquire AccessShareLocks. If we allowed higher locks
> > we might need to do deadlock detection, which would add more complexity.
> 
> But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans
> example shows. Is that a bug?
> 
> > The above restrictions are limited to LOCKTAG_OBJECT so that advisory
> > locks work as advertised. So advisory locks can take both shared and
> > exclusive locks. This never conflicts with recovery because advisory
> > locks are not WAL logged.
> 
> So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes
> advisory locks, but also relation locks, tuple locks and page locks.
> 
> Looking at the lock types in detail:
> 
> LOCKTAG_RELATION
> 
> Any lock level is allowed. We have other defenses against actually
> modifying a relation, but it feels a bit fragile and I got the
> impression from your comments that it's not intentional.

Possibly fragile, will look further. LOCKTAG_OBJECT was the important
one in testing.

> LOCKTAG_RELATION_EXTEND
> 
> Any lock level is allowed. Again, we have other defenses against
> modifying relations, but feels fragile.

This only ever happens after xid is assigned, which can never happen.
Happy to add protection if you think so.

> LOCKTAG_PAGE
> 
> Any lock level is allowed. Page locks are only used when extending a
> hash index, so it seems irrelevant what we do. I think we should
> disallow page locks in standby altogether.

As above, but OK.

> LOCKTAG_TUPLE,
> 
> Any lock level is allowed. Only used when locking a tuple for update. We
> forbid locking tuples by the general "is the transaction read-only?"
> check in executor, and if you manage to bypass that, you will fail to
> get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple
> locks.

Specifically disallowed earlier when row marks queries are requested.

> LOCKTAG_TRANSACTION,
> 
> Any lock level is allowed. Acquired in AssignTransactionId, to allow
> others to wait for the transaction to finish. We don't allow
> AssignTransactionId() during recovery, but could someone want to wait
> for a transaction to finish? All the current callers of
> XactLockTableWait() seem to be from operations that are not allowed in
> recovery. Should we take a conservative stance and disallow taking
> transaction-locks?

Only used after xid assignment, which is disallowed.

> LOCKTAG_VIRTUALTRANSACTION
> 
> Any lock level is allowed. Similar to transaction locks, but virtual
> transaction locks are held by read-only transactions as well. Also
> during recovery, and we rely on it in the code to wait for a conflicting
> transaction to finish. But we don't acquire locks to represent
> transactions in master.

Only ever requested as exclusive.

> LOCKTAG_OBJECT,
> 
> Anything higher than AccessShareLock is disallowed. Used by dependency
> walking in pg_depend.c. Also used as interlock between database start
> and DROP/CREATE DATABASE. At backend start, we normally take
> RowExclusiveLock on the database in postinit.c, but you had to modify to
> acquire AccessShareLock instead in standby mode.

Yes

> LOCKTAG_USERLOCK
> LOCKTAG_ADVISORY
> 
> Any lock level is allowed. As documented, advisory locks are per-server,
> so a lock taken in master doesn't conflict with one taken in slave.

Yes

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Thu, 2009-11-19 at 17:15 +0900, Tatsuo Ishii wrote:
> > Simon Riggs wrote:
> > > Recovery does *not* take the same locks as the original statements on
> > > the master took. For example, the WAL record for an INSERT just makes
> > > its changes without acquiring locks. This is OK as long as we only allow
> > > read-only users to acquire AccessShareLocks. If we allowed higher locks
> > > we might need to do deadlock detection, which would add more complexity.
> > 
> > But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans
> > example shows. Is that a bug?
> 
> Sorry for confusion. My example is under normal PostgreSQL, not under
> HS enabled.

Are you saying you want it to work in HS mode?

Why would you want to PREPARE an INSERT, but never execute it?

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Thu, 2009-11-19 at 17:15 +0900, Tatsuo Ishii wrote:
>>> Simon Riggs wrote:
>>>> Recovery does *not* take the same locks as the original statements on
>>>> the master took. For example, the WAL record for an INSERT just makes
>>>> its changes without acquiring locks. This is OK as long as we only allow
>>>> read-only users to acquire AccessShareLocks. If we allowed higher locks
>>>> we might need to do deadlock detection, which would add more complexity.
>>> But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans
>>> example shows. Is that a bug?
>> Sorry for confusion. My example is under normal PostgreSQL, not under
>> HS enabled.
> 
> Are you saying you want it to work in HS mode?
> 
> Why would you want to PREPARE an INSERT, but never execute it?

well I can easily imagine an application that keeps persistent 
connections and prepares all the queries it might execute after it does 
the initial connection yet being still aware of the master/slave setup.
So the scenario would be "prepare but never execute as long as we are in  recovery - but once we left recovery we can
usethem".
 

Stefan


Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Joshua D. Drake wrote:
> On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote:
>> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
>>> - When replaying b-tree deletions, we currently wait out/cancel all
>>> running (read-only) transactions. We take the ultra-conservative stance
>>> because we don't know how recent the tuples being deleted are. If we
>>> could store a better estimate for latestRemovedXid in the WAL record, we
>>> could make that less conservative.
>> Simon was explaining this issue here at JPUGCon; now that I understand
>> it, this specific issue seems like the worst usability issue in HS now.
>>  Bad enough to kill its usefulness for users, or even our ability to get
>> useful testing data; in an OLTP production database with several hundred
>> inserts per second it would result in pretty much never being able to
>> get any query which takes longer than a few seconds to complete on the
>> slave.
> 
> I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers
> that are doing more than that... This sounds pretty significant.

Agreed, it's the biggest usability issue at the moment. The
max_standby_delay option makes it less annoying, but it's still there.
I'm fine with it from a code point of view, so I'm not going to hold off
committing because of it, but it sure would be nice to address it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Selena Deckelmann
Date:
On Fri, Nov 20, 2009 at 11:14 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
>> - When replaying b-tree deletions, we currently wait out/cancel all
>> running (read-only) transactions. We take the ultra-conservative stance
>> because we don't know how recent the tuples being deleted are. If we
>> could store a better estimate for latestRemovedXid in the WAL record, we
>> could make that less conservative.
>
> Simon was explaining this issue here at JPUGCon; now that I understand
> it, this specific issue seems like the worst usability issue in HS now.
>  Bad enough to kill its usefulness for users, or even our ability to get
> useful testing data; in an OLTP production database with several hundred
> inserts per second it would result in pretty much never being able to
> get any query which takes longer than a few seconds to complete on the
> slave.

I don't think that's all that was discussed :)

Are you saying that it should not be committed if this issue still exists?

The point of getting Hot Standby into core is to provide useful
functionality. We can make it clear to people what the limitations
are, and Simon has said that he will continue to work on solving this
problem.

-selena


--
http://chesnok.com/daily - me
http://endpoint.com - work


Re: Summary and Plan for Hot Standby

From
Greg Stark
Date:
On Fri, Nov 20, 2009 at 2:58 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Right. The major use I was hoping for from HS was exactly to be able to run
> long-running queries. In once case I am thinking of we have moved the
> business intelligence uses off the OLTP server onto a londiste replica, and
> I was really wanting to move that to a Hot Standby server.

I think Simon's focus on the High Availability use case has obscured
the fact that there are two entirely complementary (and conflicting)
use cases here. If your primary reason for implementing Hot Standby is
to be able to run long-running batch queries then will probably want
to set a very high max_standby_delay or even disable it entirely. If
you set max_standby_delay to 0 then the recovery will wait
indefinitely for your batch queries to finish. You would probably need
to schedule quiet periods in order to ensure that the recovery can
catch up periodically. If you also need high availability you would
need your HA replicas to run with a low max_standby_delay setting as
well.

This doesn't mean that the index btree split problem isn't a problem
though. It's just trading one problem for another. Instead of having
all your queries summarily killed regularly you would find recovery
pausing extremely frequently for a very long time, rather than just
when vacuum runs and for a limited time.

I missed the original discussion of this problem, do you happen to
remember the subject or url for the details?

-- 
greg


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote:
> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
> > - When replaying b-tree deletions, we currently wait out/cancel all
> > running (read-only) transactions. We take the ultra-conservative stance
> > because we don't know how recent the tuples being deleted are. If we
> > could store a better estimate for latestRemovedXid in the WAL record, we
> > could make that less conservative.
> 
> Simon was explaining this issue here at JPUGCon; now that I understand
> it, this specific issue seems like the worst usability issue in HS now.
>  Bad enough to kill its usefulness for users, or even our ability to get
> useful testing data; in an OLTP production database with several hundred
> inserts per second it would result in pretty much never being able to
> get any query which takes longer than a few seconds to complete on the
> slave.

<sigh> This post isn't really very helpful. You aren't providing the
second part of the discussion, nor even requesting that this issue be
fixed. I can see such comments being taken up by people with a clear
interest in dissing HS.

The case of several hundred inserts per second would not generate any
cleanup records at all. So its not completely accurate, nor is it
acceptable to generalise. There is nothing about the HS architecture
that will prevent it from being used by high traffic sites, or for long
standby queries. The specific action that will cause problems is a work
load that generates high volume inserts and deletes. A solution is
possible.

Heikki and I had mentioned that solving this need not be part of the
initial patch, since it wouldn't effect all users. I specifically
removed my solution in July/Aug, to allow the patch to be slimmed down.

In any case, the problem does have a simple workaround that is
documented as part of the current patch. Conflict resolution is
explained in detail with the patch.

>From my side, the purpose of discussing this was to highlight something
which is not technically a bug, yet clearly still needs work before
close. And it also needs to be on the table, to allow further discussion
and generate the impetus to allow work on it in this release.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote:
> On Fri, Nov 20, 2009 at 2:58 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> > Right. The major use I was hoping for from HS was exactly to be able to run
> > long-running queries. In once case I am thinking of we have moved the
> > business intelligence uses off the OLTP server onto a londiste replica, and
> > I was really wanting to move that to a Hot Standby server.
> 
> I think Simon's focus on the High Availability use case has obscured
> the fact that there are two entirely complementary (and conflicting)
> use cases here. If your primary reason for implementing Hot Standby is
> to be able to run long-running batch queries then will probably want
> to set a very high max_standby_delay or even disable it entirely. If
> you set max_standby_delay to 0 then the recovery will wait
> indefinitely for your batch queries to finish. You would probably need
> to schedule quiet periods in order to ensure that the recovery can
> catch up periodically. If you also need high availability you would
> need your HA replicas to run with a low max_standby_delay setting as
> well.

If I read this correctly then I have provided the facilities you would
like. Can you confirm you have everything you want, or can you suggest
what extra feature is required?

> This doesn't mean that the index btree split problem isn't a problem
> though. It's just trading one problem for another. Instead of having
> all your queries summarily killed regularly you would find recovery
> pausing extremely frequently for a very long time, rather than just
> when vacuum runs and for a limited time.
> 
> I missed the original discussion of this problem, do you happen to
> remember the subject or url for the details?

December 2008; hackers; you, me and Heikki.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote:
>> I missed the original discussion of this problem, do you happen to
>> remember the subject or url for the details?
> 
> December 2008; hackers; you, me and Heikki.

Yep:
http://archives.postgresql.org/message-id/494B5FFE.4090909@enterprisedb.com

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Josh Berkus
Date:
Simon,

> <sigh> This post isn't really very helpful. You aren't providing the
> second part of the discussion, nor even requesting that this issue be
> fixed. I can see such comments being taken up by people with a clear
> interest in dissing HS.

OK, I'm requesting that the issue be fixed.  I'm not sure if it needs to
be fixed before Alpha3.  I got the impression from you that others did
not think this issue really needed fixing.

> Heikki and I had mentioned that solving this need not be part of the
> initial patch, since it wouldn't effect all users. I specifically
> removed my solution in July/Aug, to allow the patch to be slimmed down.

I guess I don't understand which users wouldn't be affected.  It seems
like the only users who would avoid this is ones who don't do deletes or
index-affecting updates.

>>From my side, the purpose of discussing this was to highlight something
> which is not technically a bug, yet clearly still needs work before
> close. And it also needs to be on the table, to allow further discussion
> and generate the impetus to allow work on it in this release.

Yes.  I'm realizing that because of the highly techincal nature of the
discussion and the language used few people other than you and Heikki
are aware of the major issues which still need work.  It would be
helpful if someone could post a summary of outstanding issues which
didn't require prior extensive experience with the HS code to
understand; possibly you could then get people trying to tackle just
those individual issues.

--Josh BErkus


Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote:
> On 11/15/09 11:07 PM, Heikki Linnakangas wrote:
> > - When replaying b-tree deletions, we currently wait out/cancel all
> > running (read-only) transactions. We take the ultra-conservative stance
> > because we don't know how recent the tuples being deleted are. If we
> > could store a better estimate for latestRemovedXid in the WAL record, we
> > could make that less conservative.
> 
> Simon was explaining this issue here at JPUGCon; now that I understand
> it, this specific issue seems like the worst usability issue in HS now.
>  Bad enough to kill its usefulness for users, or even our ability to get
> useful testing data; in an OLTP production database with several hundred
> inserts per second it would result in pretty much never being able to
> get any query which takes longer than a few seconds to complete on the
> slave.

I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers
that are doing more than that... This sounds pretty significant.

Joshua D. Drake


> 
> --Josh Berkus
> 
> 
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Summary and Plan for Hot Standby

From
Greg Stark
Date:
On Fri, Nov 20, 2009 at 7:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon Riggs wrote:
>> On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote:
>>> I missed the original discussion of this problem, do you happen to
>>> remember the subject or url for the details?
>>
>> December 2008; hackers; you, me and Heikki.
>
> Yep:
> http://archives.postgresql.org/message-id/494B5FFE.4090909@enterprisedb.com

And I can see I failed to understand the issue at the time.

From the list it looks like the last word was Simon's:
http://archives.postgresql.org/message-id/1229710177.4793.567.camel@ebony.2ndQuadrant

From discussions in the bar it sounds like this was actually a false
start however as the RecentGlobalXmin in the backend doing the split
could be less aggressive than the RecentGlobalXmin used by some other
backend to hit the hint bits leading to inconsistent results :(

I'm leaning towards having the backend actually go fetch all the
xmin/xmaxes of the pointers being pruned. It ought to be possible to
skip that check in any database with no live snapshots so recovery
performance would be unaffected on replicas not actively being used in
hot mode.

-- 
greg


Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Sun, 2009-11-15 at 17:17 -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > So I'm in favor of committing part of the HS code even if there are
> > known failure conditions, as long as those conditions are well-defined.
>
> If we're thinking of committing something that is known broken, I would
> want to have a clearly defined and trust-inspiring escape strategy.

If it is broken, we shouldn't commit it at all. Commit it to some
"other" git branch and call it, postgresql-alpha3-riggs-heikki if you
must but keep it out of core.

>
> I agree with Heikki that it would be better not to commit as long as
> any clear showstoppers remain unresolved.
>

Agreed.

Joshua D. Drake


>             regards, tom lane
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Fri, 2009-11-20 at 16:40 +0900, Josh Berkus wrote:

> Yes.  I'm realizing that because of the highly techincal nature of the
> discussion and the language used few people other than you and Heikki
> are aware of the major issues which still need work.  It would be
> helpful if someone could post a summary of outstanding issues which
> didn't require prior extensive experience with the HS code to
> understand; possibly you could then get people trying to tackle just
> those individual issues.
>

Yes I believe it is time for that. Those of us neck deep in production
loads would feel a lot better if we knew from a real world perspective
what the issue is.

Sincerely,

Joshua D. Drake


> --Josh BErkus
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Greg Stark wrote:
> From discussions in the bar it sounds like this was actually a false
> start however as the RecentGlobalXmin in the backend doing the split
> could be less aggressive than the RecentGlobalXmin used by some other
> backend to hit the hint bits leading to inconsistent results :(

Yeah, RecentGlobalXmin was wrong, it's not used at the moment.

> I'm leaning towards having the backend actually go fetch all the
> xmin/xmaxes of the pointers being pruned. It ought to be possible to
> skip that check in any database with no live snapshots so recovery
> performance would be unaffected on replicas not actively being used in
> hot mode.

Hmm, I have always been thinking that it would be detrimental to
performance to go fetch the xmin/xmaxes, but maybe it indeed wouldn't be
so bad if you could optimize the common case where there's no snapshots
in the standby. Also, if you have a very busy table where a lot of
tuples are killed, many of the heap pages will probably be in cache.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Sun, 2009-11-15 at 17:17 -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > So I'm in favor of committing part of the HS code even if there are
> > known failure conditions, as long as those conditions are well-defined.
> 
> If we're thinking of committing something that is known broken, I would
> want to have a clearly defined and trust-inspiring escape strategy.

If it is broken, we shouldn't commit it at all. Commit it to some
"other" git branch and call it, postgresql-alpha3-riggs-heikki if you
must but keep it out of core.

> 
> I agree with Heikki that it would be better not to commit as long as
> any clear showstoppers remain unresolved.
> 

Agreed.

Joshua D. Drake


>             regards, tom lane
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Summary and Plan for Hot Standby

From
"Joshua D. Drake"
Date:
On Fri, 2009-11-20 at 16:40 +0900, Josh Berkus wrote:

> Yes.  I'm realizing that because of the highly techincal nature of the
> discussion and the language used few people other than you and Heikki
> are aware of the major issues which still need work.  It would be
> helpful if someone could post a summary of outstanding issues which
> didn't require prior extensive experience with the HS code to
> understand; possibly you could then get people trying to tackle just
> those individual issues.
> 

Yes I believe it is time for that. Those of us neck deep in production
loads would feel a lot better if we knew from a real world perspective
what the issue is.

Sincerely,

Joshua D. Drake


> --Josh BErkus
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Thu, 2009-11-19 at 10:13 +0200, Heikki Linnakangas wrote:

> At backend start, we normally take
> RowExclusiveLock on the database in postinit.c, but you had to modify
> to acquire AccessShareLock instead in standby mode.

The consensus from earlier discussion was that allowing users to grab
RowExclusiveLock during read only transactions was not a problem, since
it allowed PREPARE. So there seems no need to prevent it in other
places.

So I suggest removing most of the changes in postinit.c, and changing
the lock restrictions in lock.c to be

+    if (RecoveryInProgress() &&
+        (locktag->locktag_type == LOCKTAG_OBJECT ||
+         locktag->locktag_type == LOCKTAG_RELATION ) &&
+        lockmode > RowExclusiveLock)
then ERROR

lockcmds.c would also be changed to allow LOCK TABLE of up to
RowExclusiveLock.

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

> - The assumption that b-tree vacuum records don't need conflict
> resolution because we did that with the additional cleanup-info record
> works ATM, but it hinges on the fact that we don't delete any tuples
> marked as killed while we do the vacuum. That seems like a low-hanging
> fruit that I'd actually like to do now that I spotted it, but will then
> need to fix b-tree vacuum records accordingly. 

You didn't make a change, so I wonder whether you realised no change was
required or that you still think change is necessary, but have left it
to me? Not sure.

I've investigated this but can't see any problem or need for change.

btvacuumpage() is very specific about deleting *only* index tuples that
have been collected during the VACUUM's heap scan, as identified by the
heap callback function, lazy_tid_reaped().

There is no reliance at all on the state of the index tuple. If you
ain't on the list, you ain't cleaned. I accept your observation that
some additional index tuples may be marked as killed by backends
accessing the table that is being vacuumed, since those backends could
have a RecentGlobalXmin later than the OldestXmin used by the VACUUM as
a result of the change that means GetSnapshotData() ignores lazy
vacuums. But those tuples will not be identified by the callback
function and so the "additionally killed" index tuples will not be
removed.

It is a possible future optimisation of b-tree vacuum that it cleans
these additional killed tuples while it executes, but it doesn't do it
now and so we need not worry about that for HS.

I think its important that we note this assumption though.

Comment?

-- Simon Riggs           www.2ndQuadrant.com



Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> 
>> - The assumption that b-tree vacuum records don't need conflict
>> resolution because we did that with the additional cleanup-info record
>> works ATM, but it hinges on the fact that we don't delete any tuples
>> marked as killed while we do the vacuum. That seems like a low-hanging
>> fruit that I'd actually like to do now that I spotted it, but will then
>> need to fix b-tree vacuum records accordingly. 
> 
> You didn't make a change, so I wonder whether you realised no change was
> required or that you still think change is necessary, but have left it
> to me? Not sure.
> 
> I've investigated this but can't see any problem or need for change.

Sorry if I was unclear: it works as it is. But *if* we change the b-tree
vacuum to also delete index tuples marked with LP_DEAD, we have a problem.

> I think its important that we note this assumption though.

Yeah, a comment is in order.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:

> - When switching from standby mode to normal operation, we momentarily
> hold all AccessExclusiveLocks held by prepared xacts twice, needing
> twice the lock space. You can run out of lock space at that point,
> causing failover to fail.

Proposed patch to fix that attached.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: Summary and Plan for Hot Standby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> 
>> - When switching from standby mode to normal operation, we momentarily
>> hold all AccessExclusiveLocks held by prepared xacts twice, needing
>> twice the lock space. You can run out of lock space at that point,
>> causing failover to fail.
> 
> Proposed patch to fix that attached.

Doesn't eliminate the problem completely, but certainly makes it much
less likely.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Summary and Plan for Hot Standby

From
Simon Riggs
Date:
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote:
> - When replaying b-tree deletions, we currently wait out/cancel all
> running (read-only) transactions. We take the ultra-conservative
> stance because we don't know how recent the tuples being deleted are.
> If we could store a better estimate for latestRemovedXid in the WAL
> record, we could make that less conservative.

I think I can improve on the way we do this somewhat.

When we GetConflictingVirtualXIDs() with InvalidTransactionId we include
all backends.

If a query can only see currently-running xacts then it cannot see any
data that is being cleaned up because its xmin > latestCompletedXid.

Put another way, Assert(latestRemovedXid <= latestCompletedXid)

-- Simon Riggs           www.2ndQuadrant.com