Thread: Block level concurrency during recovery

Block level concurrency during recovery

From
Simon Riggs
Date:
I'm looking at how to make queries safe during recovery, in the presence
of concurrent changes to blocks. In particular, concurrent removal of
rows that might be read by queries.

My thinking is 
* we ignore LockRelationForExtension(). Normal queries never request it.
All new blocks were created with that lock held and we are replaying
changes serially, so we do not need to re-create that lock. We already
do this, so no change.
* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

This means we will need to hold cleanup lock:

* while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)

* while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
is not always generated by VACUUM. So split this WAL record into two
types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
I'd rather do the full locking now and relax it later).

* Whenever we apply a backup block that performs the same function as
any of the above WAL records. So we would hold Cleanup lock when
applying WAL records of types all RM_HEAP2_ID types XLOG_BTREE_VACUUM

I'm most of the way through implementing the above and will post patch
as a separate item to make it easier to inspect.

Other aspects:

* For GIN indexes, we appear to not hold a Cleanup lock during
vacuuming, except on root page. That stops new scans from starting, but
it doesn't prevent progress of concurrent scans. Doesn't look correct to
me... so not sure what strength lock to acquire in each case. Probably
need to differentiate between WAL record types so we can tell which to
acquire CleanupLock for and which not.

* GIST doesn't use CleaupLocks at all. So I'm very unclear here.

Teodor has mentioned that it should be OK for GIST/GIN. Can I double
check that based upon my inspection of the code?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> * re-create the Cleanup lock on blocks, when the original operation was
> performed while a Cleanup lock was held.
>
> So the plan is to introduce a new XLogLockBufferForCleanup() function
> and then use it in all places where a cleanup lock was held during the
> write of the WAL record.

I'm not sure I'm following what you're talking about. Are you saying the slave
will have to stop when it reaches a cleanup log entry and wait until it can
obtain the exclusive lock with no other pins? 

That sounds like the right thing though it could cause a long pause in WAL
recovery. Perhaps we should plan on being able to kill other processes holding
pins on the buffer just as we discussed killing transactions holding up
advancing the globalxmin.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Mon, 2008-10-20 at 14:23 +0100, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > * re-create the Cleanup lock on blocks, when the original operation was
> > performed while a Cleanup lock was held.
> >
> > So the plan is to introduce a new XLogLockBufferForCleanup() function
> > and then use it in all places where a cleanup lock was held during the
> > write of the WAL record.
> 
> I'm not sure I'm following what you're talking about. Are you saying the slave
> will have to stop when it reaches a cleanup log entry and wait until it can
> obtain the exclusive lock with no other pins? 

Yes. 

> That sounds like the right thing though it could cause a long pause in WAL
> recovery. 

It could. 

> Perhaps we should plan on being able to kill other processes holding
> pins on the buffer just as we discussed killing transactions holding up
> advancing the globalxmin.

I'll copy my notes straight out of the Wiki:

We can't cancel a query that holds a pin on a particular block since we
don't keep track of who holds a pin. We just know a backend has a pin
and that the startup process must wait.

This may become a problem, or it may not. In most cases, backends hold
5-15 pins concurrently and pins are held for relatively short times.
Startup process will provide debug information for when pin times are
extended, and for monitoring total delay from pin waits.

Some strategies, if this becomes a problem:
     * minimise pin hold times wherever this occurs     * deferring application of WAL for pinned blocks (requiring us
to      use conditional locking)     * making VACUUM FREEZE hold stronger locks on standby to prevent       query
access(but doesn't help HOT)     * ensuring that we perform read ahead I/O for WAL records that       have not yet
reachedhead of apply queue
 

Most people on-list were comfortable with the idea of recovery waiting
for queries to finish, up to a limit, re: discussion around
max_standby_delay. So waiting for cleanup locks is just another source
of potential wait time.

I think it will take a while for people to come up with some user
experience of whether this will be a pain or not.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Teodor Sigaev
Date:
> * For GIN indexes, we appear to not hold a Cleanup lock during
> vacuuming, except on root page. That stops new scans from starting, but
> it doesn't prevent progress of concurrent scans. Doesn't look correct to
> me... so not sure what strength lock to acquire in each case. Probably
Why do you think so?

> need to differentiate between WAL record types so we can tell which to
> acquire CleanupLock for and which not.
> 
> * GIST doesn't use CleaupLocks at all. So I'm very unclear here.
Scan process does not hold any pointer on page now, insertion process holds but 
it tracks changes of page by looking at LSN of page.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Mon, 2008-10-20 at 18:28 +0400, Teodor Sigaev wrote:
> > * For GIN indexes, we appear to not hold a Cleanup lock during
> > vacuuming, except on root page. That stops new scans from starting, but
> > it doesn't prevent progress of concurrent scans. Doesn't look correct to
> > me... so not sure what strength lock to acquire in each case. Probably

> Why do you think so?

I have questions.

I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
root page for Cleanup and no others. Why do we need to hold Cleanup lock
on the root? If the index is concurrent safe for existing scans, why is
it not safe for new scans also? And the converse: if it is not safe for
new scans, why is it safe for existing scans? 

> > need to differentiate between WAL record types so we can tell which to
> > acquire CleanupLock for and which not.
> > 
> > * GIST doesn't use CleaupLocks at all. So I'm very unclear here.

> Scan process does not hold any pointer on page now, insertion process holds but 
> it tracks changes of page by looking at LSN of page.

Sounds good.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Teodor Sigaev
Date:
> I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
> root page for Cleanup and no others. Why do we need to hold Cleanup lock
> on the root? If the index is concurrent safe for existing scans, why is
> it not safe for new scans also? And the converse: if it is not safe for
> new scans, why is it safe for existing scans? 

Because we wish to prevent concurrent inserts and page deletion just to simplify 
code. LockForCleanup guarantees that insertion process is not work here (it 
keeps root buffer pinned all time of insertion). New scan processes can't start 
as a side effect.

Note, in most cases it keeps enough concurrence because all that is about work 
on one tree in GIN index. Usually, there is a lot of such trees in index - for 
each lexeme if we speak about tsearch index. So, there is a place for 
improvements but I don't believe that will give a big advantage for performance 
in typical usage of GIN.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Mon, 2008-10-20 at 20:12 +0400, Teodor Sigaev wrote:
> > I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
> > root page for Cleanup and no others. Why do we need to hold Cleanup lock
> > on the root? If the index is concurrent safe for existing scans, why is
> > it not safe for new scans also? And the converse: if it is not safe for
> > new scans, why is it safe for existing scans? 
> 
> Because we wish to prevent concurrent inserts and page deletion just to simplify 
> code. LockForCleanup guarantees that insertion process is not work here (it 
> keeps root buffer pinned all time of insertion). New scan processes can't start 
> as a side effect.

But does holding cleanup lock on root prevent an in-progress Insert from
changing non-root pages? I assume so, just not sure how.

> Note, in most cases it keeps enough concurrence because all that is about work 
> on one tree in GIN index. Usually, there is a lot of such trees in index - for 
> each lexeme if we speak about tsearch index. So, there is a place for 
> improvements but I don't believe that will give a big advantage for performance 
> in typical usage of GIN.

I'm just worried about safety during Hot Standby, not trying to improve
anything.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Teodor Sigaev
Date:
> But does holding cleanup lock on root prevent an in-progress Insert from
> changing non-root pages? I assume so, just not sure how.

Yes, because insertion process doesn't unpin root page until insert will done.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Mon, 2008-10-20 at 13:54 +0100, Simon Riggs wrote:

> I'm looking at how to make queries safe during recovery, in the presence
> of concurrent changes to blocks. In particular, concurrent removal of
> rows that might be read by queries.
>
> My thinking is
> * we ignore LockRelationForExtension(). Normal queries never request it.
> All new blocks were created with that lock held and we are replaying
> changes serially, so we do not need to re-create that lock. We already
> do this, so no change.
> * re-create the Cleanup lock on blocks, when the original operation was
> performed while a Cleanup lock was held.
>
> So the plan is to introduce a new XLogLockBufferForCleanup() function
> and then use it in all places where a cleanup lock was held during the
> write of the WAL record.
>
> This means we will need to hold cleanup lock:
>
> * while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)
>
> * while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
> is not always generated by VACUUM. So split this WAL record into two
> types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
> lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
> I'd rather do the full locking now and relax it later).
>
> * Whenever we apply a backup block that performs the same function as
> any of the above WAL records. So we would hold Cleanup lock when
> applying WAL records of types
>   all RM_HEAP2_ID types
>   XLOG_BTREE_VACUUM
>
> I'm most of the way through implementing the above and will post patch
> as a separate item to make it easier to inspect.

Here's the code to implement this portion.

I'm undecided as to whether this is too much code or too little. I'm
somewhat uncertain as to the locking requirements for the physical scan
during a vacuum. I've worked out various options if we need to change
this.

Also, the way I take a cleanup lock during RestoreBkpBlocks() seems very
ugly, but too much additional code seems over the top also.

I'm integrating into rest of patch now.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Block level concurrency during recovery

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I'm undecided as to whether this is too much code or too little. I'm
> somewhat uncertain as to the locking requirements for the physical scan
> during a vacuum. I've worked out various options if we need to change
> this.

For the B-tree, an exclusive lock is enough to modify the contents of 
the page. A cleanup lock needs to be taken on every page to ensure that 
the vacuum doesn't finish and delete a heap tuple that's about to be 
accessed by an index scan. That could be handled in WAL replay by 
dropping the exclusive lock and re-acquiring the cleanup lock, but we 
can't do that for heap vacuums.

However, we require that in b-tree vacuum, you take a cleanup lock on 
*every* leaf page of the index, not only those that you modify. That's a 
problem, because there's no trace of such pages in the WAL.

PS. FSM doesn't need cleanup locks.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

> However, we require that in b-tree vacuum, you take a cleanup lock on 
> *every* leaf page of the index, not only those that you modify. That's a 
> problem, because there's no trace of such pages in the WAL.

OK, good. Thanks for the second opinion. I'm glad you said that, cos I
felt sure anybody reading the patch would say "what the hell does this
bit do?". Now I can add it.

My solution is fairly simple:

As we pass through the table we keep track of which blocks need
visiting, then append that information onto the next WAL record. If the
last block doesn't contain removed rows, then we send a no-op message
saying which blocks to visit.

I'd already invented the XLOG_BTREE_VACUUM record, so now we just need
to augment it further with two fields: ordered array of blocks to visit,
and a doit flag.

Say we have a 10 block table, with rows to be removed on blocks 3,4,8. 
As we visit all 10 in sequence we would issue WAL records:

XLOG_BTREE_VACUUM block 3 visitFirst {1, 2} doit = true
XLOG_BTREE_VACUUM block 4 visitFirst {} doit = true
XLOG_BTREE_VACUUM block 8 visitFirst {5,6,7} doit = true
XLOG_BTREE_VACUUM block 10 visitFirst {9} doit = false

So that allows us to issue the same number of WAL messages yet include
all the required information to repeat the process correctly.

(The blocks can be visited out of sequence in some cases, hence the
ordered array of blocks to visit rather than just a first block value).

It would also be possible to introduce a special tweak there which is
that if the block is not in cache, don't read it in at all. If its not
in cache we know that nobody has a pin on it, so don't need to read it
in just to say "got the lock". That icing for later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> It would also be possible to introduce a special tweak there which is
> that if the block is not in cache, don't read it in at all. If its not
> in cache we know that nobody has a pin on it, so don't need to read it
> in just to say "got the lock". That icing for later.

Heh, that's clever :-). We could actually use a method like that to 
solve the whole problem. After the (replay of the) b-tree vacuum is 
finished, scan through all shared buffers, and get a vacuum lock on 
every page of that index that's in cache. Groveling through all shared 
buffers would be slower for small indexes, though.

I believe the "vacuum lock every leaf page" behavior is only needed for 
system catalogs. You have other issues with those still, namely that 
table locks are not yet taken appropriately, so I'm not sure if this is 
worth worrying about until that's done.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > It would also be possible to introduce a special tweak there which is
> > that if the block is not in cache, don't read it in at all. If its not
> > in cache we know that nobody has a pin on it, so don't need to read it
> > in just to say "got the lock". That icing for later.
> 
> Heh, that's clever :-). We could actually use a method like that to 
> solve the whole problem. After the (replay of the) b-tree vacuum is 
> finished, scan through all shared buffers, and get a vacuum lock on 
> every page of that index that's in cache. Groveling through all shared 
> buffers would be slower for small indexes, though.

Well, re-examining all of the assumptions in the code seems to have been
worthwhile so far. I think that makes 4 significant tweaks that have
come out of the Search For Hot Standby.

> I believe the "vacuum lock every leaf page" behavior is only needed for 
> system catalogs. 

So we will still need it then. Which is good 'cos I just wrote it.

> You have other issues with those still, namely that 
> table locks are not yet taken appropriately, so I'm not sure if this is 
> worth worrying about until that's done.

Please explain. If you know of a correctness issue, please say.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
>> You have other issues with those still, namely that 
>> table locks are not yet taken appropriately, so I'm not sure if this is 
>> worth worrying about until that's done.
> 
> Please explain. If you know of a correctness issue, please say.

Things like DROP TABLE need to take an AccessExclusiveLock, and you 
don't have that yet, no?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 20:24 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
> >> You have other issues with those still, namely that 
> >> table locks are not yet taken appropriately, so I'm not sure if this is 
> >> worth worrying about until that's done.
> > 
> > Please explain. If you know of a correctness issue, please say.
> 
> Things like DROP TABLE need to take an AccessExclusiveLock, and you 
> don't have that yet, no?

Oh right, I thought you meant something not already in the plan.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 09:57 +0100, Simon Riggs wrote:
> On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:
> 
> > However, we require that in b-tree vacuum, you take a cleanup lock on 
> > *every* leaf page of the index, not only those that you modify. That's a 
> > problem, because there's no trace of such pages in the WAL.
> 
> OK, good. Thanks for the second opinion. I'm glad you said that, cos I
> felt sure anybody reading the patch would say "what the hell does this
> bit do?". Now I can add it.

Heikki,

When we discussed this before, I was glad that you'd mentioned that
aspect since it allowed me to say "if two of us think that then it must
be true".

I didn't include that in the final patch because it felt wrong. I didn't
have a rational explanation for that then, just a bad feeling. So, after
lots of sleep, here's my rational explanation of why we do *not* need
that during hot standby queries:

VACUUM with a btree index proceeds like this:
1. Scan table
2. Remove rows from btree identified in (1)
3. Remove rows from heap identified in (1)

The purpose of the additional locking requirements during (2) for btrees
is to ensure that we do not fail to find the rows identified in (1),
because the rows can move after (1) and during (2) because of block
splits. 

Requoting verbatim from the README: "The tricky part of this is to avoid
missing any deletable tuples in the presence of concurrent page splits:
a page split could easily move some tuples from a page not yet passed
over by the sequential scan to a lower-numbered page already passed
over." In recovery there are no concurrent page splits and the WAL
records represent already successfully identified deletable tuples.

On a standby server the rows will not move other than via WAL records.
So there is no possibility that a WAL record will fail to find the row
it was looking for. On the master we were looking for a tuple that
pointed to a htid, whereas in WAL replay we look directly at the index
tuple via its tid, not via the htid it points to. Therefore we do not
need the additional locking.

That seems logical to me, so I will leave that out.

Any alternative views?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> VACUUM with a btree index proceeds like this:
> 1. Scan table
> 2. Remove rows from btree identified in (1)
> 3. Remove rows from heap identified in (1)

> The purpose of the additional locking requirements during (2) for btrees
> is to ensure that we do not fail to find the rows identified in (1),
> because the rows can move after (1) and during (2) because of block
> splits. 

No, you are missing the point.  One purpose of the additional locking
requirements is to ensure that there is not a concurrent process that
has read a btree index entry just before you removed it but arrives at
the heap page only after you removed the heap entry (and, perhaps,
replaced it with some other row that doesn't match the index entry at
all).  This is clearly still a risk in a hot-standby environment.
        regards, tom lane


Re: Block level concurrency during recovery

From
Simon Riggs
Date:
On Mon, 2008-11-03 at 10:07 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > VACUUM with a btree index proceeds like this:
> > 1. Scan table
> > 2. Remove rows from btree identified in (1)
> > 3. Remove rows from heap identified in (1)
> 
> > The purpose of the additional locking requirements during (2) for btrees
> > is to ensure that we do not fail to find the rows identified in (1),
> > because the rows can move after (1) and during (2) because of block
> > splits. 
> 
> No, you are missing the point.  One purpose of the additional locking
> requirements is to ensure that there is not a concurrent process that
> has read a btree index entry just before you removed it but arrives at
> the heap page only after you removed the heap entry (and, perhaps,
> replaced it with some other row that doesn't match the index entry at
> all).  This is clearly still a risk in a hot-standby environment.

OK, I think I get it now. Thanks for putting me straight.

So I will implement the locking-every-page approach discussed upthread.
So I will just keep note of the blocks touched exactly in that order and
store the info accordingly onto the WAL records.

Are you happy with my optimisation that if a page needs to be read in,
we just skip it and pretend we did read-pin-unpin on it? I would
implement that as a new ReadBuffer mode (in Heikki's new API
terminology).

If you know/can see any other missing correctness requirements please
let me know. I've not had trouble understanding any of the other
correctness requirements, but I'll leave it to review to judge whether
I've implemented them all correctly.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block level concurrency during recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Are you happy with my optimisation that if a page needs to be read in,
> we just skip it and pretend we did read-pin-unpin on it?

If it's not in buffers then it cannot be pinned, so I concur that that
is safe.
        regards, tom lane