Thread: Gist Recovery testing

Gist Recovery testing

From
Teodor Sigaev
Date:
Ok, I already realized WAL in GiST and tested with online backup. It works.
Now I need test it on recovery after crash (power loss etc) and there is a 
problem in GiST with incompleted inserts, I hope, I resolved it, but this 
problem, I think, has the same nature as incompleted splits in btree code. For 
solving, btree xlog code uses stack of incompleted splits and if they leaves 
incompleted, btree_xlog_cleanup function completes them. I make use similar 
idea, but I can't reproduce situation when xlog_cleanup makes some useful work 
except just looks at void stack. So question: how is it possible to reproduce 
desired behaviour?

I tried to kill(getpid(), SIGKILL) in a middle of insertion (inside gist.c), but 
postgres doesn't try ro restore aborted transaction....

Thank you.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Gist Recovery testing

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> Ok, I already realized WAL in GiST and tested with online backup. It works.
> Now I need test it on recovery after crash (power loss etc) and there is a 
> problem in GiST with incompleted inserts, I hope, I resolved it, but this 
> problem, I think, has the same nature as incompleted splits in btree code. For 
> solving, btree xlog code uses stack of incompleted splits and if they leaves 
> incompleted, btree_xlog_cleanup function completes them. I make use similar 
> idea, but I can't reproduce situation when xlog_cleanup makes some useful work 
> except just looks at void stack. So question: how is it possible to reproduce 
> desired behaviour?

It's not very easy; you have to arrange for the WAL history to end
between the two action records that you are interested in.  What I'd
suggest is- run test backend under GDB, set breakpoint between  making log entry for first split step and entry for
second.-let it reach the breakpoint- in another backend, commit some unrelated transaction to  force XLOG buffers to
disk-in GDB, "kill -9" to prevent test backend from proceeding
 

If you want to trace through the recovery as well, the best bet might be
to temporarily add a "sleep(30)" near the beginning of StartupXlog so
that you have time to reattach to the recovery process.
        regards, tom lane


Re: Gist Recovery testing

From
Oleg Bartunov
Date:
There is a problem with gist_xlog_cleanup function - we have no
information about user defined functions and indices, because postmaster
doesn't started yet and system catalog isn't available. For btree it's not
necessary, but for GiST we certainly need them. Is't possible to add
additional hook after clean up stage, so we could use it to complete
incompleted inserts  ? Any suggestions ?
    Oleg
On Fri, 10 Jun 2005, Tom Lane wrote:

> Teodor Sigaev <teodor@sigaev.ru> writes:
>> Ok, I already realized WAL in GiST and tested with online backup. It works.
>> Now I need test it on recovery after crash (power loss etc) and there is a
>> problem in GiST with incompleted inserts, I hope, I resolved it, but this
>> problem, I think, has the same nature as incompleted splits in btree code. For
>> solving, btree xlog code uses stack of incompleted splits and if they leaves
>> incompleted, btree_xlog_cleanup function completes them. I make use similar
>> idea, but I can't reproduce situation when xlog_cleanup makes some useful work
>> except just looks at void stack. So question: how is it possible to reproduce
>> desired behaviour?
>
> It's not very easy; you have to arrange for the WAL history to end
> between the two action records that you are interested in.  What I'd
> suggest is
>     - run test backend under GDB, set breakpoint between
>       making log entry for first split step and entry for second.
>     - let it reach the breakpoint
>     - in another backend, commit some unrelated transaction to
>       force XLOG buffers to disk
>     - in GDB, "kill -9" to prevent test backend from proceeding
>
> If you want to trace through the recovery as well, the best bet might be
> to temporarily add a "sleep(30)" near the beginning of StartupXlog so
> that you have time to reattach to the recovery process.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Gist Recovery testing

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> There is a problem with gist_xlog_cleanup function - we have no
> information about user defined functions and indices, because postmaster
> doesn't started yet and system catalog isn't available. For btree it's not
> necessary, but for GiST we certainly need them.

btree manages to avoid calling the index support functions during WAL
restore --- can't you do the same?  Perhaps you need to be including
more information in the initial xlog records, so that the cleanup step
has everything it needs.  Or resort to brute-force search (which is more
or less what btree does).  I don't think this operation needs to be very
efficient, since it's a corner case that should only seldom be invoked.

It's a really bad idea to be calling any support functions during WAL
recovery, since you won't have a transaction open for them to run in...
        regards, tom lane


Re: Gist Recovery testing

From
Teodor Sigaev
Date:
> btree manages to avoid calling the index support functions during WAL
> restore --- can't you do the same?  Perhaps you need to be including
> more information in the initial xlog records, so that the cleanup step
> has everything it needs.  Or resort to brute-force search (which is more
> or less what btree does).  I don't think this operation needs to be very
> efficient, since it's a corner case that should only seldom be invoked.

I've just commited WALogging for GiST. It works for online backup  (normal 
recovery) and mostly after crash, but in this case it can't restore inclompleted  inserts although it can detected and
sayto log thet it's needed to reindex 
 
GiST index.

The problem with incopmleted inserts is: when new entry is installed into leaf 
page, all chain (in a worst case) of keys from root page to leaf should be 
updated. Of course, case of updating all chain is rarely and usially it's 
updated only part. Each key on inner pages contains union of keys (bounding box 
in a case of rtree, for example) on page which it points. This union can formed 
only with a help of user-defined function of opclass, because of GiST doesn't 
know something about nature of keys. Returning to WAL, GiST core write xlog 
entry with all nessary information for restoration before write page, but in 
this moment it doesn't know it should update keys on parent page or key is 
unchanged.  So GiST's WAL restoration code should remember this page's update as 
incompleted insert. When insert complited, GiST's core write to log that insert 
is completed and restore code can clean up stored incompleted insert. If it was 
crash, then sign of completed insert can be absent in log, and GiST's restore 
code should continue it. While continue, it's know which page was changed and 
should walk up to root. On each step of walk it should form union for page and 
insert it to parent.

This is absent in btree, because of btree is sorted, so parent's key simply is 
right on child page and xlog code doesn't need to call any support function.


Which ways I see:
1 Leave it as is. Incomplited inserts isn't often case, and code can detect it 
to suggest reindex.
2 In principle, its possible to collect all information about changes per insert 
in the tree in one WAL record. But itsn't good solution, because of it needs to 
ugly coding of inserts: collect all pages and write it after pushing WAL record. 
Second, this manner kills using LSN for page control for concurrency. But it 
fully solve problem with incomplete insert.
3 Construct relation by XLogOpenRelation some more full than now. In fact, 
restore code needs only 2 thing about index: index->rd_att and information for 
working index_getprocinfo. I understand why idea to calling any support 
functions during WAL recovery is bad.
4 Add one hook to finalyze WAL code. This hook should be called when system 
catalog is maded accessible.
5 There is one unused bit in IndexTupleData. GiST code can use it as mark that 
this tuple contains incorrect union key. In this case, GiST search algorithm 
should keep it mind that such tuple has incorrect value and always should go to 
it's child to check. During insertions in tree such tuples will be little by 
little changed to correct. Defect of this scheme will be a possible performance 
degradation after crash recovery. In other hand, it's possible to add recreation 
such tuples to vacuum code (in gistbulkdelete).





-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Gist Recovery testing

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> ... Which ways I see:
> 5 There is one unused bit in IndexTupleData. GiST code can use it as
> mark that this tuple contains incorrect union key. In this case, GiST
> search algorithm should keep it mind that such tuple has incorrect
> value and always should go to it's child to check. During insertions
> in tree such tuples will be little by little changed to
> correct. Defect of this scheme will be a possible performance
> degradation after crash recovery. In other hand, it's possible to add
> recreation such tuples to vacuum code (in gistbulkdelete).

This seems like a good answer --- if you fix up such things during
vacuum then the performance issue won't last too long.

Rather than eating up the extra flag bit, though, would it be possible
to change the tuple to appear to contain a NULL?  AFAIK, GiST can't have
a NULL (at least not in the first column) so this should be an unused
state.
        regards, tom lane


Re: Gist Recovery testing

From
Teodor Sigaev
Date:
> 
> This seems like a good answer --- if you fix up such things during
> vacuum then the performance issue won't last too long.
> 
> Rather than eating up the extra flag bit, though, would it be possible
> to change the tuple to appear to contain a NULL?  AFAIK, GiST can't have
> a NULL (at least not in the first column) so this should be an unused
> state.

We would like to preserve NULL as non-special value because we hope to add 
support of NULLs to GiST, although it's of low priority. In other side, 
IndexTupleData contains ItemPointerData which contains block number and offset. 
GiST doesn't use offset on inner pages (on leaf pages, ItemPointerData points to 
heap tuple), it always equal FirstOffsetNumber. So I can use offset to indicate 
correctness of tuple's key...



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Gist Recovery testing

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> Rather than eating up the extra flag bit, though, would it be possible
>> to change the tuple to appear to contain a NULL?

> We would like to preserve NULL as non-special value because we hope to add 
> support of NULLs to GiST, although it's of low priority.

Fair enough.

> In other side, IndexTupleData contains ItemPointerData which contains
> block number and offset.  GiST doesn't use offset on inner pages (on
> leaf pages, ItemPointerData points to heap tuple), it always equal
> FirstOffsetNumber. So I can use offset to indicate correctness of
> tuple's key...

OK, that's a tad klugy but I agree with the approach.  I'd like to
preserve the spare flag bit in IndexTuple->t_info --- we might need
that someday.
        regards, tom lane


Re: Gist Recovery testing

From
Stefan Kaltenbrunner
Date:
Teodor Sigaev wrote:
>> btree manages to avoid calling the index support functions during WAL
>> restore --- can't you do the same?  Perhaps you need to be including
>> more information in the initial xlog records, so that the cleanup step
>> has everything it needs.  Or resort to brute-force search (which is more
>> or less what btree does).  I don't think this operation needs to be very
>> efficient, since it's a corner case that should only seldom be invoked.
> 
> 
> I've just commited WALogging for GiST. It works for online backup 
> (normal recovery) and mostly after crash, but in this case it can't
> restore inclompleted  inserts although it can detected and say to log
> thet it's needed to reindex GiST index.


FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on
the buildfarm crashing during testing of GIST-related things in contrib.

Any chance this could be related to this change ?


Stefan


Re: Gist Recovery testing

From
Oleg Bartunov
Date:
On Wed, 15 Jun 2005, Stefan Kaltenbrunner wrote:

> Teodor Sigaev wrote:
>>> btree manages to avoid calling the index support functions during WAL
>>> restore --- can't you do the same?  Perhaps you need to be including
>>> more information in the initial xlog records, so that the cleanup step
>>> has everything it needs.  Or resort to brute-force search (which is more
>>> or less what btree does).  I don't think this operation needs to be very
>>> efficient, since it's a corner case that should only seldom be invoked.
>>
>>
>> I've just commited WALogging for GiST. It works for online backup
>> (normal recovery) and mostly after crash, but in this case it can't
>> restore inclompleted  inserts although it can detected and say to log
>> thet it's needed to reindex GiST index.
>
>
> FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on
> the buildfarm crashing during testing of GIST-related things in contrib.
>
> Any chance this could be related to this change ?

Most probably :) But, wait a little bit. 
We have a patch currently tested and I see no problem with all GiST-based 
contribs on my Slackware Linux 10.1 using it.

>
>
> Stefan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Gist Recovery testing

From
Stefan Kaltenbrunner
Date:
Oleg Bartunov wrote:
> On Wed, 15 Jun 2005, Stefan Kaltenbrunner wrote:
> 
>> Teodor Sigaev wrote:
>>
>>>> btree manages to avoid calling the index support functions during WAL
>>>> restore --- can't you do the same?  Perhaps you need to be including
>>>> more information in the initial xlog records, so that the cleanup step
>>>> has everything it needs.  Or resort to brute-force search (which is
>>>> more
>>>> or less what btree does).  I don't think this operation needs to be
>>>> very
>>>> efficient, since it's a corner case that should only seldom be invoked.
>>>
>>>
>>>
>>> I've just commited WALogging for GiST. It works for online backup
>>> (normal recovery) and mostly after crash, but in this case it can't
>>> restore inclompleted  inserts although it can detected and say to log
>>> thet it's needed to reindex GiST index.
>>
>>
>>
>> FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on
>> the buildfarm crashing during testing of GIST-related things in contrib.
>>
>> Any chance this could be related to this change ?
> 
> 
> Most probably :) But, wait a little bit. We have a patch currently
> tested and I see no problem with all GiST-based contribs on my Slackware
> Linux 10.1 using it.


I played a little bit on lionfish(this is the result of a COPY of the
btree_gist testdata into an variant of the regressiontest tables) and
managed to get the following backtrace:


#0  gistmakedeal (state=0x0, giststate=0x7fff5128) at gist.c:597
#1  0x00436658 in gistdoinsert (r=0x2c0752e0, itup=0x100b4c10,
giststate=0x7fff5128) at gist.c:325
#2  0x00436444 in gistinsert (fcinfo=0x2b52eab0) at gist.c:288
#3  0x0073522c in FunctionCall6 (flinfo=0x2b52eab0, arg1=39, arg2=0,
arg3=810, arg4=5, arg5=39, arg6=5) at fmgr.c:1270
#4  0x0045aca8 in index_insert (indexRelation=0x2c0752e0,
values=0x7fff6920, isnull=0x7fff69a0 "", heap_t_ctid=0x100b2bec,
heapRelation=0x1, check_uniqueness=0 '\0')   at indexam.c:215
#5  0x00580074 in ExecInsertIndexTuples (slot=0x100ad710,
tupleid=0x100b2bec, estate=0x100ab5c0, is_vacuum=0 '\0') at execUtils.c:935
#6  0x00519800 in CopyFrom (rel=0x2c072a90, attnumlist=0x100ab200,
binary=0 '\0', oids=0 '\0', delim=0x7c277c "\t", null_print=0x7c2774
"\\N", csv_mode=0 '\0',   quote=0x0, escape=0x0, force_notnull_atts=0x0, header_line=0 '\0')
at copy.c:1955
#7  0x00515f08 in DoCopy (stmt=0x2b52eab0) at copy.c:1032
#8  0x00671868 in ProcessUtility (parsetree=0x10090f10, params=0x0,
dest=0x10090f78, completionTag=0x7fff6f78 "") at utility.c:608
#9  0x0066f5ec in PortalRunUtility (portal=0x100990c8, query=0x10090fc8,
dest=0x10090f78, completionTag=0x7fff6f78 "") at pquery.c:940
#10 0x0066fbb0 in PortalRunMulti (portal=0x100990c8, dest=0x10090f78,
altdest=0x10090f78, completionTag=0x7fff6f78 "") at pquery.c:1007
#11 0x0066eb30 in PortalRun (portal=0x100990c8, count=2147483647,
dest=0x10090f78, altdest=0x10090f78, completionTag=0x7fff6f78 "") at
pquery.c:617
#12 0x00666f60 in exec_simple_query (query_string=0x10090be8 "COPY
inettmp FROM STDIN") at postgres.c:1021
#13 0x0066be54 in PostgresMain (argc=4, argv=0x100513c0,
username=0x10051390 "pgbuild") at postgres.c:3186
#14 0x00621304 in BackendRun (port=0x10060300) at postmaster.c:2800
#15 0x006208bc in BackendStartup (port=0x10060300) at postmaster.c:2440
#16 0x0061d23c in ServerLoop () at postmaster.c:1221
#17 0x0061b6d0 in PostmasterMain (argc=3, argv=0x10050e40) at
postmaster.c:930
#18 0x005ab904 in main (argc=3, argv=0x10050e40) at main.c:268



Stefan


Re: Gist Recovery testing

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on
> the buildfarm crashing during testing of GIST-related things in contrib.

I'm seeing some problems on Mac OS X, too.  The tsearch regression test
crashed ... which we may not care about much since tsearch is presumably
going away ... but after that, the postmaster failed to restart:

LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/4980748
PANIC:  gistRedoEntryUpdateRecord: uninitialized page
LOG:  startup process (PID 1728) was terminated by signal 6
LOG:  aborting startup due to startup process failure

Stack trace for that is

#0  0x9004a12c in kill ()
#1  0x90120954 in abort ()
#2  0x001cfcdc in errfinish (dummy=0) at elog.c:451
#3  0x001d06bc in elog_finish (elevel=-1073761600, fmt=0x2818dc "\002") at elog.c:932
#4  0x0000c7c4 in gistRedoEntryUpdateRecord (lsn={xlogid = 0, xrecoff = 77628316}, record=0x57, isnewroot=0 '\0') at
gistxlog.c:186
#5  0x0000ca0c in gist_redo (lsn={xlogid = 0, xrecoff = 77628316}, record=0xa8d000) at gistxlog.c:399
#6  0x00042b4c in StartupXLOG () at xlog.c:4509
#7  0x0004c9d8 in BootstrapMain (argc=4, argv=0xbfffde28) at bootstrap.c:413
#8  0x00126278 in StartChildProcess (xlop=2) at postmaster.c:3484
#9  0x00126858 in reaper (postgres_signal_arg=0) at postmaster.c:2165
#10 <signal handler called>
#11 0x9001efe8 in select ()
#12 0x00126e70 in ServerLoop () at postmaster.c:1168
#13 0x00128c54 in PostmasterMain (argc=3, argv=0xd00600) at postmaster.c:930
#14 0x000e3050 in main (argc=3, argv=0xbfffe57c) at main.c:268

I checked that it was processing a type-0 (XLOG_GIST_ENTRY_UPDATE)
record, but am not sure what else to look at.  I do think it's
questionable that this log record type doesn't appear to reference
any buffers in the list it passes to XLogInsert.  I believe the
general rule is that an xlog record that is describing a change in
a buffered page ought to tell XLogInsert so --- and the redo routine
has to check to see if the change was replaced by a whole-page record.
I'm not entirely sure that violating that rule leads directly to this
class of failure, but it definitely leads to failure to recover from
partial page writes during a hardware crash.
        regards, tom lane


Re: Gist Recovery testing

From
Tom Lane
Date:
I wrote:
> I'm seeing some problems on Mac OS X, too.  The tsearch regression test
> crashed ... which we may not care about much since tsearch is presumably
> going away ... but after that, the postmaster failed to restart:

I tried it again to check that it was reproducible.  The tsearch test
crashed in apparently the same place, but the recovery failed in a
*different* place:

LOG:  redo starts at 0/470054
PANIC:  lock 207 is not held
LOG:  startup process (PID 2077) was terminated by signal 6
LOG:  aborting startup due to startup process failure

(gdb) bt
#0  0x9004a12c in kill ()
#1  0x90120954 in abort ()
#2  0x001cfcdc in errfinish (dummy=0) at elog.c:451
#3  0x001d06bc in elog_finish (elevel=-1073761552, fmt=0x2818dc "\002") at elog.c:932
#4  0x0014d718 in LWLockRelease (lockid=207) at lwlock.c:442
#5  0x0000cff8 in gist_redo (lsn={xlogid = 0, xrecoff = 5130896}, record=0x14) at gistxlog.c:315
#6  0x00042b4c in StartupXLOG () at xlog.c:4509
#7  0x0004c9d8 in BootstrapMain (argc=4, argv=0xbfffde28) at bootstrap.c:413
#8  0x00126278 in StartChildProcess (xlop=2) at postmaster.c:3484
#9  0x00126858 in reaper (postgres_signal_arg=0) at postmaster.c:2165
#10 <signal handler called>
#11 0x9001efe8 in select ()
#12 0x00126e70 in ServerLoop () at postmaster.c:1168
#13 0x00128c54 in PostmasterMain (argc=3, argv=0xd00600) at postmaster.c:930
#14 0x000e3050 in main (argc=3, argv=0xbfffe57c) at main.c:268

I find this code pretty unreadable, but possibly lines 315 and 316
should be manipulating newpage->buffer instead of buffer?
        regards, tom lane


Re: Gist Recovery testing

From
Teodor Sigaev
Date:
> I find this code pretty unreadable, but possibly lines 315 and 316
> should be manipulating newpage->buffer instead of buffer?
> 

You are right, I'll commit changes some later with a lot of additional work..

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/