Re: Fractal tree indexing - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Fractal tree indexing
Date
Msg-id CAM-w4HOvnPuOiyWky3dtuYhXW_K+Un4VsLhWPE0=vdXOJ9+tbg@mail.gmail.com
Whole thread Raw
In response to Re: Fractal tree indexing  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Wed, Feb 13, 2013 at 1:31 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Wed, Feb 13, 2013 at 4:51 PM, Greg Stark <stark@mit.edu> wrote:
>>
>> Heikki was talking about a generic WAL record type that would just
>> store a binary delta between the version of the block when it was
>> locked and when it was unlocked. That would handle any extension
>> cleanly as far as data modification goes as long as the extension was
>> working through our buffer manager. It seems like an attractive idea
>> to me.
>
>
> It will, for sure, works well when atomic page changes are enough for us.
> However, some operations, for example, page splits, contain changes in
> multiple pages. Replaying changes in only some of pages is not fair. Now,
> it's hard for me to imagine how to generalize it into generic WAL record
> type.

I think multiple-page operations where you have all the pages locked
at the same time, like tuple updates for example, are fairly simple.
The existing WAL format can handle at most three such buffers in a
single record so we can just have a fixed size buffer large enough to
hold three buffers and perform the diff on the three when the
extension says it has completed an atomic update. The simplest API
which I suspect would suffice for virtually all users would be to tie
this to buffer locking and unlocking.

The types of records which this would not suffice for would be

a) things that need extra book-keeping during recovery in case some
later record was not recorded. We have only a few such records -- page
splits as you say -- and hopefully extensions wouldn't need them
because frankly they're pretty scary.

b) records that are needed not just to maintain consistency of the
data in the database but to provide some other behaviour -- for
instance the WAL records for locks that 2PC needs or the snapshot
records that hot standby needs. But then these types of records might
be more amenable to an extensible WAL format. Since the loss of them
wouldn't leave the database corrupted, just prevent that feature from
operating correctly.

-- 
greg



pgsql-hackers by date:

Previous
From: Fabrízio de Royes Mello
Date:
Subject: Re: proposal or just idea for psql - show first N rows from relation backslash statement
Next
From: Noah Misch
Date:
Subject: Re: BUG #7493: Postmaster messages unreadable in a Windows console