Re: Block level concurrency during recovery - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Block level concurrency during recovery
Date
Msg-id 1224752254.27145.608.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: Block level concurrency during recovery  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Block level concurrency during recovery
Re: Block level concurrency during recovery
List pgsql-hackers
On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

> However, we require that in b-tree vacuum, you take a cleanup lock on 
> *every* leaf page of the index, not only those that you modify. That's a 
> problem, because there's no trace of such pages in the WAL.

OK, good. Thanks for the second opinion. I'm glad you said that, cos I
felt sure anybody reading the patch would say "what the hell does this
bit do?". Now I can add it.

My solution is fairly simple:

As we pass through the table we keep track of which blocks need
visiting, then append that information onto the next WAL record. If the
last block doesn't contain removed rows, then we send a no-op message
saying which blocks to visit.

I'd already invented the XLOG_BTREE_VACUUM record, so now we just need
to augment it further with two fields: ordered array of blocks to visit,
and a doit flag.

Say we have a 10 block table, with rows to be removed on blocks 3,4,8. 
As we visit all 10 in sequence we would issue WAL records:

XLOG_BTREE_VACUUM block 3 visitFirst {1, 2} doit = true
XLOG_BTREE_VACUUM block 4 visitFirst {} doit = true
XLOG_BTREE_VACUUM block 8 visitFirst {5,6,7} doit = true
XLOG_BTREE_VACUUM block 10 visitFirst {9} doit = false

So that allows us to issue the same number of WAL messages yet include
all the required information to repeat the process correctly.

(The blocks can be visited out of sequence in some cases, hence the
ordered array of blocks to visit rather than just a first block value).

It would also be possible to introduce a special tweak there which is
that if the block is not in cache, don't read it in at all. If its not
in cache we know that nobody has a pin on it, so don't need to read it
in just to say "got the lock". That icing for later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Unicode escapes in literals
Next
From: Heikki Linnakangas
Date:
Subject: Re: Deriving Recovery Snapshots