Thread: Ответ: WAL and indexes (Re: [HACKERS] WAL status & todo)

Ответ: WAL and indexes (Re: [HACKERS] WAL status & todo)

From
"Mikheev, Vadim"
Date:
>I don't understand why WAL needs to log internal operations of any of
>the index types.  Seems to me that you could treat indexes as black
>boxes that are updated as side effects of WAL log items for heap tuples:
>when adding a heap tuple as a result of a WAL item, you just call the
>usual index insert routines, and when deleting a heap tuple as a result

On recovery backend *can't* use any usual routines:
system catalogs are not available.

>of undoing a WAL item, you mark the tuple invalid but don't physically
>remove it till VACUUM (thus no need to worry about its index entries).

One of the purposes of WAL is immediate removing tuples
inserted by aborted xactions. I want make VACUUM
*optional* in future - space must be available for
reusing without VACUUM. And this is first, very small,
step in this direction.

>This doesn't address the issue of recovering from an incomplete index
>update (such as a partially-completed btree page split), but I think
>the most reliable way to do that is to add WAL records on the order of
>"update beginning for index X" and "update done for index X".  If you
>see the begin and not the done record when replaying a log, you assume

You will still have to log changes for *each* page
updated on behalf of index operation! The fact that
you've seen begin/end records in log doesn't mean
that all intermediate changes to index pages are
written to index file unless you've logged all these
changes and see all of them in index on recovery.

>the index is corrupt and rebuild it from scratch, using Hiroshi's
>index-rebuild code.

How fast is rebuilding of index for table with
10^7 records?
I agree to consider rtree/hash/gist as experimental
index access methods BUT we have to have at least
*one* reliable index AM with short down time/
fast recovery.

>For that matter I am far from convinced that the currently committed
>code for btree WAL logging is correct --- where does it cope with
>cleaning up after an unfinished page split?  I don't see it.

What do you mean? If you say about updating parent
page ("my bits moved ..." etc) then as I've mentioned
previously we can handle uninserted parent item in
run time (though it's not implemented yet -:)).
WAL allows to restore both left and right siblings
and this is the most critical split issue.
(BTW, note that having all btitems on leaf level
at place we could do REINDEX veeeeery fast).

>Since we have very poor testing capabilities for the non-mainstream
>index types (remember how I broke rtree completely during 6.5 devel,
>and no one noticed till quite late in beta?) I will have absolutely
>zero confidence in WAL support for these index types if it's implemented
>this way.  I think we should go with a black-box approach that's the
>same for all index types and is implemented completely outside the
>index-access-method-specific code.

I agreed with this approach for all indices except
btree (above + "hey, something is already done for
them" -:)). But remember that to know is index
consistent or not we have to log *all* changes made
in index file anyway... so seems we have to be
very close to be AM specific -:)

Vadim



Re: Otvet: WAL and indexes (Re: [HACKERS] WAL status & todo)

From
Alfred Perlstein
Date:
* Mikheev, Vadim <vmikheev@SECTORBASE.COM> [001016 09:33] wrote:
> >I don't understand why WAL needs to log internal operations of any of
> >the index types.  Seems to me that you could treat indexes as black
> >boxes that are updated as side effects of WAL log items for heap tuples:
> >when adding a heap tuple as a result of a WAL item, you just call the
> >usual index insert routines, and when deleting a heap tuple as a result
> 
> On recovery backend *can't* use any usual routines:
> system catalogs are not available.
> 
> >of undoing a WAL item, you mark the tuple invalid but don't physically
> >remove it till VACUUM (thus no need to worry about its index entries).
> 
> One of the purposes of WAL is immediate removing tuples 
> inserted by aborted xactions. I want make VACUUM
> *optional* in future - space must be available for
> reusing without VACUUM. And this is first, very small,
> step in this direction.

Why would vacuum become optional?  Would WAL offer an option to
not reclaim free space?  We're hoping that vacuum becomes unneeded
when postgresql is run with some flag indicating that we're
uninterested in time travel.

How much longer do you estimate until you can make it work that way?

thanks,
-Alfred


Re: Ответ: WAL and indexes (Re: [HACKERS] WAL status & todo)

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> I don't understand why WAL needs to log internal operations of any of
>> the index types.  Seems to me that you could treat indexes as black
>> boxes that are updated as side effects of WAL log items for heap tuples:
>> when adding a heap tuple as a result of a WAL item, you just call the
>> usual index insert routines, and when deleting a heap tuple as a result

> On recovery backend *can't* use any usual routines:
> system catalogs are not available.

OK, good point, but that just means you can't use the catalogs to
discover what indexes exist for a given table.  You could still create
log entries that look like "insert indextuple X into index Y" without
any further detail.

>> the index is corrupt and rebuild it from scratch, using Hiroshi's
>> index-rebuild code.

> How fast is rebuilding of index for table with 10^7 records?

It's not fast, of course.  But the point is that you should seldom
have to do it.

> I agree to consider rtree/hash/gist as experimental
> index access methods BUT we have to have at least
> *one* reliable index AM with short down time/
> fast recovery.

With all due respect, I wonder just how "reliable" btree WAL undo/redo
will prove to be ... let alone the other index types.  I worry that
this approach is putting too much emphasis on making it fast, and not
enough on making it right.

            regards, tom lane


Re: Re: Otvet: WAL and indexes (Re: [HACKERS] WAL status & todo)

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [001016 09:47] wrote:
> "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> >> I don't understand why WAL needs to log internal operations of any of
> >> the index types.  Seems to me that you could treat indexes as black
> >> boxes that are updated as side effects of WAL log items for heap tuples:
> >> when adding a heap tuple as a result of a WAL item, you just call the
> >> usual index insert routines, and when deleting a heap tuple as a result
> 
> > On recovery backend *can't* use any usual routines:
> > system catalogs are not available.
> 
> OK, good point, but that just means you can't use the catalogs to
> discover what indexes exist for a given table.  You could still create
> log entries that look like "insert indextuple X into index Y" without
> any further detail.

One thing you guys may wish to consider is selectively fsyncing on
system catelogs and marking them dirty when opened for write:

postgres:  i need to write to a critical table...
opens table, marks dirty
completes operation and marks undirty and fsync

-or-

postgres:  i need to write to a critical table...
opens table, marks dirty
crash, burn, smoke (whatever)

Now you may still have the system tables broken, however the chances
of that may be siginifigantly reduced depending on how often writes
must be done to them.

It's a hack, but depending on the amount of writes done to critical
tables it may reduce the window for these inconvient situations 
signifigantly.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."