Thread: Gist Recovery testing
Ok, I already realized WAL in GiST and tested with online backup. It works. Now I need test it on recovery after crash (power loss etc) and there is a problem in GiST with incompleted inserts, I hope, I resolved it, but this problem, I think, has the same nature as incompleted splits in btree code. For solving, btree xlog code uses stack of incompleted splits and if they leaves incompleted, btree_xlog_cleanup function completes them. I make use similar idea, but I can't reproduce situation when xlog_cleanup makes some useful work except just looks at void stack. So question: how is it possible to reproduce desired behaviour? I tried to kill(getpid(), SIGKILL) in a middle of insertion (inside gist.c), but postgres doesn't try ro restore aborted transaction.... Thank you. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > Ok, I already realized WAL in GiST and tested with online backup. It works. > Now I need test it on recovery after crash (power loss etc) and there is a > problem in GiST with incompleted inserts, I hope, I resolved it, but this > problem, I think, has the same nature as incompleted splits in btree code. For > solving, btree xlog code uses stack of incompleted splits and if they leaves > incompleted, btree_xlog_cleanup function completes them. I make use similar > idea, but I can't reproduce situation when xlog_cleanup makes some useful work > except just looks at void stack. So question: how is it possible to reproduce > desired behaviour? It's not very easy; you have to arrange for the WAL history to end between the two action records that you are interested in. What I'd suggest is- run test backend under GDB, set breakpoint between making log entry for first split step and entry for second.-let it reach the breakpoint- in another backend, commit some unrelated transaction to force XLOG buffers to disk-in GDB, "kill -9" to prevent test backend from proceeding If you want to trace through the recovery as well, the best bet might be to temporarily add a "sleep(30)" near the beginning of StartupXlog so that you have time to reattach to the recovery process. regards, tom lane
There is a problem with gist_xlog_cleanup function - we have no information about user defined functions and indices, because postmaster doesn't started yet and system catalog isn't available. For btree it's not necessary, but for GiST we certainly need them. Is't possible to add additional hook after clean up stage, so we could use it to complete incompleted inserts ? Any suggestions ? Oleg On Fri, 10 Jun 2005, Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> Ok, I already realized WAL in GiST and tested with online backup. It works. >> Now I need test it on recovery after crash (power loss etc) and there is a >> problem in GiST with incompleted inserts, I hope, I resolved it, but this >> problem, I think, has the same nature as incompleted splits in btree code. For >> solving, btree xlog code uses stack of incompleted splits and if they leaves >> incompleted, btree_xlog_cleanup function completes them. I make use similar >> idea, but I can't reproduce situation when xlog_cleanup makes some useful work >> except just looks at void stack. So question: how is it possible to reproduce >> desired behaviour? > > It's not very easy; you have to arrange for the WAL history to end > between the two action records that you are interested in. What I'd > suggest is > - run test backend under GDB, set breakpoint between > making log entry for first split step and entry for second. > - let it reach the breakpoint > - in another backend, commit some unrelated transaction to > force XLOG buffers to disk > - in GDB, "kill -9" to prevent test backend from proceeding > > If you want to trace through the recovery as well, the best bet might be > to temporarily add a "sleep(30)" near the beginning of StartupXlog so > that you have time to reattach to the recovery process. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > There is a problem with gist_xlog_cleanup function - we have no > information about user defined functions and indices, because postmaster > doesn't started yet and system catalog isn't available. For btree it's not > necessary, but for GiST we certainly need them. btree manages to avoid calling the index support functions during WAL restore --- can't you do the same? Perhaps you need to be including more information in the initial xlog records, so that the cleanup step has everything it needs. Or resort to brute-force search (which is more or less what btree does). I don't think this operation needs to be very efficient, since it's a corner case that should only seldom be invoked. It's a really bad idea to be calling any support functions during WAL recovery, since you won't have a transaction open for them to run in... regards, tom lane
> btree manages to avoid calling the index support functions during WAL > restore --- can't you do the same? Perhaps you need to be including > more information in the initial xlog records, so that the cleanup step > has everything it needs. Or resort to brute-force search (which is more > or less what btree does). I don't think this operation needs to be very > efficient, since it's a corner case that should only seldom be invoked. I've just commited WALogging for GiST. It works for online backup (normal recovery) and mostly after crash, but in this case it can't restore inclompleted inserts although it can detected and sayto log thet it's needed to reindex GiST index. The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent. This is absent in btree, because of btree is sorted, so parent's key simply is right on child page and xlog code doesn't need to call any support function. Which ways I see: 1 Leave it as is. Incomplited inserts isn't often case, and code can detect it to suggest reindex. 2 In principle, its possible to collect all information about changes per insert in the tree in one WAL record. But itsn't good solution, because of it needs to ugly coding of inserts: collect all pages and write it after pushing WAL record. Second, this manner kills using LSN for page control for concurrency. But it fully solve problem with incomplete insert. 3 Construct relation by XLogOpenRelation some more full than now. In fact, restore code needs only 2 thing about index: index->rd_att and information for working index_getprocinfo. I understand why idea to calling any support functions during WAL recovery is bad. 4 Add one hook to finalyze WAL code. This hook should be called when system catalog is maded accessible. 5 There is one unused bit in IndexTupleData. GiST code can use it as mark that this tuple contains incorrect union key. In this case, GiST search algorithm should keep it mind that such tuple has incorrect value and always should go to it's child to check. During insertions in tree such tuples will be little by little changed to correct. Defect of this scheme will be a possible performance degradation after crash recovery. In other hand, it's possible to add recreation such tuples to vacuum code (in gistbulkdelete). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > ... Which ways I see: > 5 There is one unused bit in IndexTupleData. GiST code can use it as > mark that this tuple contains incorrect union key. In this case, GiST > search algorithm should keep it mind that such tuple has incorrect > value and always should go to it's child to check. During insertions > in tree such tuples will be little by little changed to > correct. Defect of this scheme will be a possible performance > degradation after crash recovery. In other hand, it's possible to add > recreation such tuples to vacuum code (in gistbulkdelete). This seems like a good answer --- if you fix up such things during vacuum then the performance issue won't last too long. Rather than eating up the extra flag bit, though, would it be possible to change the tuple to appear to contain a NULL? AFAIK, GiST can't have a NULL (at least not in the first column) so this should be an unused state. regards, tom lane
> > This seems like a good answer --- if you fix up such things during > vacuum then the performance issue won't last too long. > > Rather than eating up the extra flag bit, though, would it be possible > to change the tuple to appear to contain a NULL? AFAIK, GiST can't have > a NULL (at least not in the first column) so this should be an unused > state. We would like to preserve NULL as non-special value because we hope to add support of NULLs to GiST, although it's of low priority. In other side, IndexTupleData contains ItemPointerData which contains block number and offset. GiST doesn't use offset on inner pages (on leaf pages, ItemPointerData points to heap tuple), it always equal FirstOffsetNumber. So I can use offset to indicate correctness of tuple's key... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> Rather than eating up the extra flag bit, though, would it be possible >> to change the tuple to appear to contain a NULL? > We would like to preserve NULL as non-special value because we hope to add > support of NULLs to GiST, although it's of low priority. Fair enough. > In other side, IndexTupleData contains ItemPointerData which contains > block number and offset. GiST doesn't use offset on inner pages (on > leaf pages, ItemPointerData points to heap tuple), it always equal > FirstOffsetNumber. So I can use offset to indicate correctness of > tuple's key... OK, that's a tad klugy but I agree with the approach. I'd like to preserve the spare flag bit in IndexTuple->t_info --- we might need that someday. regards, tom lane
Teodor Sigaev wrote: >> btree manages to avoid calling the index support functions during WAL >> restore --- can't you do the same? Perhaps you need to be including >> more information in the initial xlog records, so that the cleanup step >> has everything it needs. Or resort to brute-force search (which is more >> or less what btree does). I don't think this operation needs to be very >> efficient, since it's a corner case that should only seldom be invoked. > > > I've just commited WALogging for GiST. It works for online backup > (normal recovery) and mostly after crash, but in this case it can't > restore inclompleted inserts although it can detected and say to log > thet it's needed to reindex GiST index. FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on the buildfarm crashing during testing of GIST-related things in contrib. Any chance this could be related to this change ? Stefan
On Wed, 15 Jun 2005, Stefan Kaltenbrunner wrote: > Teodor Sigaev wrote: >>> btree manages to avoid calling the index support functions during WAL >>> restore --- can't you do the same? Perhaps you need to be including >>> more information in the initial xlog records, so that the cleanup step >>> has everything it needs. Or resort to brute-force search (which is more >>> or less what btree does). I don't think this operation needs to be very >>> efficient, since it's a corner case that should only seldom be invoked. >> >> >> I've just commited WALogging for GiST. It works for online backup >> (normal recovery) and mostly after crash, but in this case it can't >> restore inclompleted inserts although it can detected and say to log >> thet it's needed to reindex GiST index. > > > FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on > the buildfarm crashing during testing of GIST-related things in contrib. > > Any chance this could be related to this change ? Most probably :) But, wait a little bit. We have a patch currently tested and I see no problem with all GiST-based contribs on my Slackware Linux 10.1 using it. > > > Stefan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov wrote: > On Wed, 15 Jun 2005, Stefan Kaltenbrunner wrote: > >> Teodor Sigaev wrote: >> >>>> btree manages to avoid calling the index support functions during WAL >>>> restore --- can't you do the same? Perhaps you need to be including >>>> more information in the initial xlog records, so that the cleanup step >>>> has everything it needs. Or resort to brute-force search (which is >>>> more >>>> or less what btree does). I don't think this operation needs to be >>>> very >>>> efficient, since it's a corner case that should only seldom be invoked. >>> >>> >>> >>> I've just commited WALogging for GiST. It works for online backup >>> (normal recovery) and mostly after crash, but in this case it can't >>> restore inclompleted inserts although it can detected and say to log >>> thet it's needed to reindex GiST index. >> >> >> >> FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on >> the buildfarm crashing during testing of GIST-related things in contrib. >> >> Any chance this could be related to this change ? > > > Most probably :) But, wait a little bit. We have a patch currently > tested and I see no problem with all GiST-based contribs on my Slackware > Linux 10.1 using it. I played a little bit on lionfish(this is the result of a COPY of the btree_gist testdata into an variant of the regressiontest tables) and managed to get the following backtrace: #0 gistmakedeal (state=0x0, giststate=0x7fff5128) at gist.c:597 #1 0x00436658 in gistdoinsert (r=0x2c0752e0, itup=0x100b4c10, giststate=0x7fff5128) at gist.c:325 #2 0x00436444 in gistinsert (fcinfo=0x2b52eab0) at gist.c:288 #3 0x0073522c in FunctionCall6 (flinfo=0x2b52eab0, arg1=39, arg2=0, arg3=810, arg4=5, arg5=39, arg6=5) at fmgr.c:1270 #4 0x0045aca8 in index_insert (indexRelation=0x2c0752e0, values=0x7fff6920, isnull=0x7fff69a0 "", heap_t_ctid=0x100b2bec, heapRelation=0x1, check_uniqueness=0 '\0') at indexam.c:215 #5 0x00580074 in ExecInsertIndexTuples (slot=0x100ad710, tupleid=0x100b2bec, estate=0x100ab5c0, is_vacuum=0 '\0') at execUtils.c:935 #6 0x00519800 in CopyFrom (rel=0x2c072a90, attnumlist=0x100ab200, binary=0 '\0', oids=0 '\0', delim=0x7c277c "\t", null_print=0x7c2774 "\\N", csv_mode=0 '\0', quote=0x0, escape=0x0, force_notnull_atts=0x0, header_line=0 '\0') at copy.c:1955 #7 0x00515f08 in DoCopy (stmt=0x2b52eab0) at copy.c:1032 #8 0x00671868 in ProcessUtility (parsetree=0x10090f10, params=0x0, dest=0x10090f78, completionTag=0x7fff6f78 "") at utility.c:608 #9 0x0066f5ec in PortalRunUtility (portal=0x100990c8, query=0x10090fc8, dest=0x10090f78, completionTag=0x7fff6f78 "") at pquery.c:940 #10 0x0066fbb0 in PortalRunMulti (portal=0x100990c8, dest=0x10090f78, altdest=0x10090f78, completionTag=0x7fff6f78 "") at pquery.c:1007 #11 0x0066eb30 in PortalRun (portal=0x100990c8, count=2147483647, dest=0x10090f78, altdest=0x10090f78, completionTag=0x7fff6f78 "") at pquery.c:617 #12 0x00666f60 in exec_simple_query (query_string=0x10090be8 "COPY inettmp FROM STDIN") at postgres.c:1021 #13 0x0066be54 in PostgresMain (argc=4, argv=0x100513c0, username=0x10051390 "pgbuild") at postgres.c:3186 #14 0x00621304 in BackendRun (port=0x10060300) at postmaster.c:2800 #15 0x006208bc in BackendStartup (port=0x10060300) at postmaster.c:2440 #16 0x0061d23c in ServerLoop () at postmaster.c:1221 #17 0x0061b6d0 in PostmasterMain (argc=3, argv=0x10050e40) at postmaster.c:930 #18 0x005ab904 in main (argc=3, argv=0x10050e40) at main.c:268 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on > the buildfarm crashing during testing of GIST-related things in contrib. I'm seeing some problems on Mac OS X, too. The tsearch regression test crashed ... which we may not care about much since tsearch is presumably going away ... but after that, the postmaster failed to restart: LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/4980748 PANIC: gistRedoEntryUpdateRecord: uninitialized page LOG: startup process (PID 1728) was terminated by signal 6 LOG: aborting startup due to startup process failure Stack trace for that is #0 0x9004a12c in kill () #1 0x90120954 in abort () #2 0x001cfcdc in errfinish (dummy=0) at elog.c:451 #3 0x001d06bc in elog_finish (elevel=-1073761600, fmt=0x2818dc "\002") at elog.c:932 #4 0x0000c7c4 in gistRedoEntryUpdateRecord (lsn={xlogid = 0, xrecoff = 77628316}, record=0x57, isnewroot=0 '\0') at gistxlog.c:186 #5 0x0000ca0c in gist_redo (lsn={xlogid = 0, xrecoff = 77628316}, record=0xa8d000) at gistxlog.c:399 #6 0x00042b4c in StartupXLOG () at xlog.c:4509 #7 0x0004c9d8 in BootstrapMain (argc=4, argv=0xbfffde28) at bootstrap.c:413 #8 0x00126278 in StartChildProcess (xlop=2) at postmaster.c:3484 #9 0x00126858 in reaper (postgres_signal_arg=0) at postmaster.c:2165 #10 <signal handler called> #11 0x9001efe8 in select () #12 0x00126e70 in ServerLoop () at postmaster.c:1168 #13 0x00128c54 in PostmasterMain (argc=3, argv=0xd00600) at postmaster.c:930 #14 0x000e3050 in main (argc=3, argv=0xbfffe57c) at main.c:268 I checked that it was processing a type-0 (XLOG_GIST_ENTRY_UPDATE) record, but am not sure what else to look at. I do think it's questionable that this log record type doesn't appear to reference any buffers in the list it passes to XLogInsert. I believe the general rule is that an xlog record that is describing a change in a buffered page ought to tell XLogInsert so --- and the redo routine has to check to see if the change was replaced by a whole-page record. I'm not entirely sure that violating that rule leads directly to this class of failure, but it definitely leads to failure to recover from partial page writes during a hardware crash. regards, tom lane
I wrote: > I'm seeing some problems on Mac OS X, too. The tsearch regression test > crashed ... which we may not care about much since tsearch is presumably > going away ... but after that, the postmaster failed to restart: I tried it again to check that it was reproducible. The tsearch test crashed in apparently the same place, but the recovery failed in a *different* place: LOG: redo starts at 0/470054 PANIC: lock 207 is not held LOG: startup process (PID 2077) was terminated by signal 6 LOG: aborting startup due to startup process failure (gdb) bt #0 0x9004a12c in kill () #1 0x90120954 in abort () #2 0x001cfcdc in errfinish (dummy=0) at elog.c:451 #3 0x001d06bc in elog_finish (elevel=-1073761552, fmt=0x2818dc "\002") at elog.c:932 #4 0x0014d718 in LWLockRelease (lockid=207) at lwlock.c:442 #5 0x0000cff8 in gist_redo (lsn={xlogid = 0, xrecoff = 5130896}, record=0x14) at gistxlog.c:315 #6 0x00042b4c in StartupXLOG () at xlog.c:4509 #7 0x0004c9d8 in BootstrapMain (argc=4, argv=0xbfffde28) at bootstrap.c:413 #8 0x00126278 in StartChildProcess (xlop=2) at postmaster.c:3484 #9 0x00126858 in reaper (postgres_signal_arg=0) at postmaster.c:2165 #10 <signal handler called> #11 0x9001efe8 in select () #12 0x00126e70 in ServerLoop () at postmaster.c:1168 #13 0x00128c54 in PostmasterMain (argc=3, argv=0xd00600) at postmaster.c:930 #14 0x000e3050 in main (argc=3, argv=0xbfffe57c) at main.c:268 I find this code pretty unreadable, but possibly lines 315 and 316 should be manipulating newpage->buffer instead of buffer? regards, tom lane
> I find this code pretty unreadable, but possibly lines 315 and 316 > should be manipulating newpage->buffer instead of buffer? > You are right, I'll commit changes some later with a lot of additional work.. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/