Thread: Index use during Hot Standby

Index use during Hot Standby

From

Simon Riggs

Date:

20 October 2008, 08:05:15

For Hot Standby I need to mark which indexes are usable or not.

There are three aspects to index use during recovery:
* Certain index types may not correctly implement fully concurrent
locking order to allow that index type to be used during recovery.
* Other indexes might become unusable as indexAM code determines
particular indexes are corrupt. 
* Any index type that is not WAL logged will never be usable during
recovery. 

I've looked into the first of those concerns and can't find any reason
to believe index _redo code is dangerous for concurrent use. In
particular, even though some actions are split across multiple WAL
records does not imply that the individual parts of that action are not
concurrent. Please could people double check that analysis?

Assuming that is OK, the second and third concerns need to be addressed.
Currently, the second concern is simply ignored. Should we continue
that?

We can handle the third concern in a couple of ways:

1. Implement WAL logging for hash indexes
2. Implement an extra flag on pg_am that would be checked when we build
relcache so that all indexes of a certain type show as invalid during
recovery.
3. Implement an extra indexAM API call that allows indexAM to decide
when/if index is valid during recovery. This would also cover the second
concern neatly in a single API call.

I heard that people were working on (1). Are they going to make the
deadline or not?

If not, I will have to do some further work on this. I suggest that I
wait until after deadline to implement (2) or (3), in case somebody
fixes this up in the next few weeks.

I'm also happy to help out a little with adding WAL to hash indexes. In
general, every place that we do MarkBufferDirty() we need to make some
WAL entries (probably). MarkBufferDirty is rarely called directly
(once...), but the next level of code is called in these places:

hash.c
hashbulkdelete()

hashovfl.c
_hash_addovflpage()
_hash_getovflpage
_hash_freeovflpage -- looks complex
_hash_initbitmap
_hash_squeezebucket

hashpage.c
_hash_metapinit()
_hash_splitbucket()
_hash_expandtable()

hashinsert.c
_hash_doinsert()


-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Index use during Hot Standby

From

Teodor Sigaev

Date:

20 October 2008, 11:24:23

> 3. Implement an extra indexAM API call that allows indexAM to decide
> when/if index is valid during recovery. This would also cover the second
> concern neatly in a single API call.
> 
> wait until after deadline to implement (2) or (3), in case somebody
> fixes this up in the next few weeks.
> 

IMHO, Without locking of pages in recovery mode Btree and GIN are not usable 
while incomplete split exists - there is  a nonconnected branch in tree.

GiST has similar issue - incomplete insert. One insertion in leaf page can 
produce updating of keys up to the root. During that split pages may occurs.
So, it's needed to add to gistxlog.c tracking of pages split to get exact 
knowledge about moments of unusability of index.

One more thing about GiST - when database is switched from recovery mode to the 
normal mode then it's needed to complete insertion in GiST and, possibly, vacuum 
index. Dig around GistBulkDeleteResult->needFullVacuum and gistContinueInsert()

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index use during Hot Standby

From

Simon Riggs

Date:

20 October 2008, 12:09:43

On Mon, 2008-10-20 at 18:24 +0400, Teodor Sigaev wrote:
> > 3. Implement an extra indexAM API call that allows indexAM to decide
> > when/if index is valid during recovery. This would also cover the second
> > concern neatly in a single API call.
> > 
> > wait until after deadline to implement (2) or (3), in case somebody
> > fixes this up in the next few weeks.
> > 
> 
> IMHO, Without locking of pages in recovery mode Btree and GIN are not usable 
> while incomplete split exists - there is  a nonconnected branch in tree.

I think the two things are not necessarily related to each other.

Incomplete splits prevent us from using a checkpoint as a valid
restartpoint from recovery. So that means an archive recovery always
starts from a point where we have all the WAL information required to
complete any incomplete splits.

That has nothing to do with locking requirements for concurrency. Now
you may be right, that we do not have the correct locking for
concurrency in the splitting algorithms. But I haven't seen any issue
yet myself. But I will look again. If you could be more specific that
would help me.

What I would say is that we need not consider the whole index to be
unusable at a point in time. If we emulate the lock sequence in recovery
that was used on the master, then we should have the same concurrency.
So preventing access to the whole index should be unnecessary.

Thinking some more on this, the database is not in a usable state until
all splits which were incomplete at time recovery started have been
completed. This is because the first half of the split is on disk
already, but we haven't reacquired the locks we held while making the
split in the first place. So we do have at least one problem in this
area.

> GiST has similar issue - incomplete insert. One insertion in leaf page can 
> produce updating of keys up to the root. During that split pages may occurs.
> So, it's needed to add to gistxlog.c tracking of pages split to get exact 
> knowledge about moments of unusability of index.
> 
> One more thing about GiST - when database is switched from recovery mode to the 
> normal mode then it's needed to complete insertion in GiST and, possibly, vacuum 
> index. Dig around GistBulkDeleteResult->needFullVacuum and gistContinueInsert()

OK, will look at those. (Problems++).

Thanks,

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Index use during Hot Standby

From

Teodor Sigaev

Date:

20 October 2008, 12:55:39

>> One more thing about GiST - when database is switched from recovery mode to the 
>> normal mode then it's needed to complete insertion in GiST and, possibly, vacuum 
>> index. Dig around GistBulkDeleteResult->needFullVacuum and gistContinueInsert()
> 
> OK, will look at those. (Problems++).
Not a very big issue: in this case index is still in workable state but not very 
effective until vacuum it.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index use during Hot Standby

From

Simon Riggs

Date:

20 October 2008, 14:25:17

On Mon, 2008-10-20 at 16:11 +0100, Simon Riggs wrote:
> On Mon, 2008-10-20 at 18:24 +0400, Teodor Sigaev wrote:
> > > 3. Implement an extra indexAM API call that allows indexAM to decide
> > > when/if index is valid during recovery. This would also cover the second
> > > concern neatly in a single API call.
> > > 
> > > wait until after deadline to implement (2) or (3), in case somebody
> > > fixes this up in the next few weeks.
> > > 
> > 
> > IMHO, Without locking of pages in recovery mode Btree and GIN are not usable 
> > while incomplete split exists - there is  a nonconnected branch in tree.
> 
> Now
> you may be right, that we do not have the correct locking for
> concurrency in the splitting algorithms. But I haven't seen any issue
> yet myself. But I will look again. 

OK, I think I've found a problem.

In _bt_insertonpg(), if we split we do _bt_split() then do
_bt_insert_parent(), which then does _bt_insertonpg() recursively.

_bt_split() writes a WAL record but continues holding write locks.
btree_xlog_split() reads WAL record and does *not* continue to hold
write locks. So recovery locking differs from Lehman & Yao requirements
at that point.

So to make this concurrent safe, we must keep holding the locks in
btree_xlog_split() and record details of that as part of the the
log_incomplete_split process. That means skipping UnlockReleaseBuffer()
at the end of btree_xlog_split(). Those buffers would remain pinned and
locked.

If btree_xlog_insert() is called on a non-leaf node we will then get the
details of blocks still holding write locks from the incomplete split
data. Those locks could be held awhile if the incomplete split process
is screwed, but it would make this concurrent safe.

I guess we'd use the same technique for GIN. ginInsertValue() ??
Hmm, you release the lock at line 412, ginbtree.c before you get the
parent lock at line 428. That seems different to the L&Y interactions.
Am I looking in the wrong place?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Index use during Hot Standby

From

Heikki Linnakangas

Date:

20 October 2008, 15:16:02

Simon Riggs wrote:
> OK, I think I've found a problem.
> 
> In _bt_insertonpg(), if we split we do _bt_split() then do
> _bt_insert_parent(), which then does _bt_insertonpg() recursively.
> 
> _bt_split() writes a WAL record but continues holding write locks.
> btree_xlog_split() reads WAL record and does *not* continue to hold
> write locks. So recovery locking differs from Lehman & Yao requirements
> at that point.

Hmm. I don't have Lehman & Yao's paper at hand, but I fail to see what 
would go wrong.

Recovery of a split works like this:

1. Reconstruct new right sibling from scratch. Keep locked
2. Update old page (= new left sibling). Keep locked
3. Release locks on both pages.
4. Update the left-link of the page to the right of the new right sibling.

Searches descending work just fine without the pointer in the parent 
page to the new right sibling, just slower because they will always land 
on the left sibling, and might have move right from there. Searchers 
moving from left to right work fine; they will see either the old page, 
or both the new left and right sibling. Searchers moving right to left 
will likewise work; they will see either the old page, or the new right, 
then left page, or between steps 3 and 4, they will move to the left 
page, see that the right-link doesn't point to the page it came from, 
and move right to the new right sibling.

All that works just like during normal operation, so I don't actually 
understand why L&Y requires that you keep the split pages locked until 
you've locked the parent. Maybe it's needed to handle concurrent inserts 
or splits, but there can't be any during WAL replay.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index use during Hot Standby

From

Simon Riggs

Date:

20 October 2008, 17:07:11

On Mon, 2008-10-20 at 21:11 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > OK, I think I've found a problem.
> > 
> > In _bt_insertonpg(), if we split we do _bt_split() then do
> > _bt_insert_parent(), which then does _bt_insertonpg() recursively.
> > 
> > _bt_split() writes a WAL record but continues holding write locks.
> > btree_xlog_split() reads WAL record and does *not* continue to hold
> > write locks. So recovery locking differs from Lehman & Yao requirements
> > at that point.
> 
> Hmm. I don't have Lehman & Yao's paper at hand, but I fail to see what 
> would go wrong.
> 
> Recovery of a split works like this:
> 
> 1. Reconstruct new right sibling from scratch. Keep locked
> 2. Update old page (= new left sibling). Keep locked
> 3. Release locks on both pages.
> 4. Update the left-link of the page to the right of the new right sibling.
> 
> Searches descending work just fine without the pointer in the parent 
> page to the new right sibling, just slower because they will always land 
> on the left sibling, and might have move right from there. Searchers 
> moving from left to right work fine; they will see either the old page, 
> or both the new left and right sibling. Searchers moving right to left 
> will likewise work; they will see either the old page, or the new right, 
> then left page, or between steps 3 and 4, they will move to the left 
> page, see that the right-link doesn't point to the page it came from, 
> and move right to the new right sibling.
> 
> All that works just like during normal operation, so I don't actually 
> understand why L&Y requires that you keep the split pages locked until 
> you've locked the parent. Maybe it's needed to handle concurrent inserts 
> or splits, but there can't be any during WAL replay.

I think you're right to question that. I was happy to say "locking must
be identical", which is correct, but the assumptions are different in
recovery, as you point out. The details you cite are not as important as
the realisation that we need to get concurrency correct from the
perspective of only a single inserter and many readers. I'd overlooked
that basic assumption, but its important we clearly state that:
"Recovery operations are serialized and therefore recovery operations
need not fully emulate the locking required for multiple concurrent
writers."

Grokking the paper suggests to me you are correct and that the double
locking can only be required for correct ordering of page split
operations. It clearly isn't needed at all for the correctness proof
with a single inserter on p.350.

So updating blocks for a page split on any one level is performed by a
single WAL record and therefore atomic. None of the things I worried
about earlier need addressing, so we're back to where I was this
morning: no changes required for correct concurrency behaviour.

Thanks everybody for a valuable discussion.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Index use during Hot Standby

From

Teodor Sigaev

Date:

21 October 2008, 07:11:42

> I guess we'd use the same technique for GIN. ginInsertValue() ??
> Hmm, you release the lock at line 412, ginbtree.c before you get the
> parent lock at line 428. That seems different to the L&Y interactions.
> Am I looking in the wrong place?

at line 412 new page (right page) is unlocked, old page (left one) is unlocked 
later - at line 448, after parent page is locked.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index use during Hot Standby

From

Simon Riggs

Date:

21 October 2008, 07:47:58

On Tue, 2008-10-21 at 14:11 +0400, Teodor Sigaev wrote:
> > I guess we'd use the same technique for GIN. ginInsertValue() ??
> > Hmm, you release the lock at line 412, ginbtree.c before you get the
> > parent lock at line 428. That seems different to the L&Y interactions.
> > Am I looking in the wrong place?
> 
> at line 412 new page (right page) is unlocked, old page (left one) is unlocked 
> later - at line 448, after parent page is locked.

Thanks for checking.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support