Thread: pgsql: Improve gist XLOG code to follow the coding rules needed to
pgsql: Improve gist XLOG code to follow the coding rules needed to
From
tgl@postgresql.org (Tom Lane)
Date:
Log Message: ----------- Improve gist XLOG code to follow the coding rules needed to prevent torn-page problems. This introduces some issues of its own, mainly that there are now some critical sections of unreasonably broad scope, but it's a step forward anyway. Further cleanup will require some code refactoring that I'd prefer to get Oleg and Teodor involved in. Modified Files: -------------- pgsql/src/backend/access/gist: gist.c (r1.129 -> r1.130) (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gist.c.diff?r1=1.129&r2=1.130) gistvacuum.c (r1.16 -> r1.17) (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gistvacuum.c.diff?r1=1.16&r2=1.17) gistxlog.c (r1.12 -> r1.13) (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gistxlog.c.diff?r1=1.12&r2=1.13) pgsql/src/include/access: gist_private.h (r1.11 -> r1.12) (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/include/access/gist_private.h.diff?r1=1.11&r2=1.12)
> ----------- > Improve gist XLOG code to follow the coding rules needed to prevent > torn-page problems. This introduces some issues of its own, mainly > that there are now some critical sections of unreasonably broad scope, > but it's a step forward anyway. Further cleanup will require some > code refactoring that I'd prefer to get Oleg and Teodor involved in. Here I am. Oleg now is in expedition to solar eclipse (should return soon). Some answers on your XXX * gist.c:gistnewroot GISTInitBuffer(buffer, 0); /* XXX not F_LEAF? */ F_LEAF will be never set in new root, because gistnewroot is called when split of old root page is occured. Even it was a leaf (tree had only one page), new root will be non-leaf page. * gistxlog.c:gistContinueInsert /* * XXX fall out to avoid making LOG message at bottom of routine. * I think the logic for when to emit that message is all wrong... */ At line 543 itup vector was filled by invalid tuple, so newly filled root will contains only invalid tuples. Hence, reindex/vacuum full is required. Sorry, I missed something, what is torn-page problem? Looking in your changes, I decided, that now it needs to around by CRIT_SECTION any buffer/page changes? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > Here I am. Oleg now is in expedition to solar eclipse (should return soon). Cool ... I hope to see one someday ... > * gistxlog.c:gistContinueInsert > /* > * XXX fall out to avoid making LOG message at bottom of routine. > * I think the logic for when to emit that message is all wrong... > */ > At line 543 itup vector was filled by invalid tuple, so newly filled root > will contains only invalid tuples. Hence, reindex/vacuum full is required. Hmm. That probably needs to be redesigned then. The problem is that as the thing is written, you *must* reinit the buffer here --- the code I removed that checked the page LSN was unsafe. If you want to depend on the prior page contents during replay, you have to give XLogInsert the opportunity to save the whole page when the xlog entry is made. > Sorry, I missed something, what is torn-page problem? The torn-page problem is where a page only gets partially written to disk during a power failure. When we come back up, the LSN might look like the page is up to date, when in reality the data is old/inconsistent. We deal with this by having the first WAL record that touches a given page after a checkpoint contain a copy of the whole page, and then we can rewrite the page from that copy instead of trying to replay the incremental update. I've fixed the code to do this correctly for the gistRedoPageUpdateRecord case, but I'm not sure what we do for gistContinueInsert. BTW, there's another problem with gistContinueInsert: it's not making WAL entries for the actions it takes. It needs to do so --- consider for example the PITR case, where the log will be shipped somewhere else and then followed verbatim. So you've got to add WAL entries for any recovery actions you take after reading the existing WAL entries. The critical-section problem is not related to that; it's about being sure that we don't make any changes to shared buffers that don't have an associated WAL record. (See "WAL dirty-buffer management bug" thread on -hackers.) I fixed the gist code so that it holds a critical section throughout doing an insertion or vacuum operation, but that's really far too wide --- there's too much chance of an elog() somewhere in the process. In particular, we really don't want to call any user-defined datatype functions inside the critical section. So the process needs to be to compute all the required changes *but not make them*, then start the critical section, then apply the changes and make the WAL record. This seemed like a large enough change that I figured I'd better talk with you about it. One idea I had was to generate the WAL record as the output of the decision-making code, and then the critical section would use the WAL record as its guide to what to do to the buffers. BTW, I'm confused about gistadjscans: seems to me that's either broken or unnecessary. Since we no longer hold an exclusive lock on a gist index while inserting into it, the comment at gistscan.c line 33 is certainly wrong now. If the adjustment is still necessary then some other mechanism needs to be devised to get the information to other backends. If it's not necessary then I think we should take it out. I'm not totally familiar with the gist logic, but it looks to me like gistadjscans is only called during an insert into a non-leaf page, which makes me think it might be unnecessary --- do gist index scans ever stop on non-leaf pages? regards, tom lane
>> At line 543 itup vector was filled by invalid tuple, so newly filled root >> will contains only invalid tuples. Hence, reindex/vacuum full is required. > > Hmm. That probably needs to be redesigned then. The problem is that as > the thing is written, you *must* reinit the buffer here --- the code I > removed that checked the page LSN was unsafe. If you want to depend on > the prior page contents during replay, you have to give XLogInsert the > opportunity to save the whole page when the xlog entry is made. Don't understand. Root will be fully regenerated and filled with invalid tuples. > old/inconsistent. We deal with this by having the first WAL record that > touches a given page after a checkpoint contain a copy of the whole > page, and then we can rewrite the page from that copy instead of trying I see, there is a problem with gistSplit: it can generate more than 3 pages (three - is a limit of XLogInsert) in a case of bad user's picksplit method... But it is a very "corner" case... In practice, I don't see more than three pages, three pages are rarely generated by gist__int_ops (contrib/intarray). > BTW, there's another problem with gistContinueInsert: it's not making > WAL entries for the actions it takes. It needs to do so --- consider > for example the PITR case, where the log will be shipped somewhere else > and then followed verbatim. So you've got to add WAL entries for any > recovery actions you take after reading the existing WAL entries. Ugh, I see > process. In particular, we really don't want to call any user-defined > datatype functions inside the critical section. Agree > be to compute all the required changes *but not make them*, then start > the critical section, then apply the changes and make the WAL record. > This seemed like a large enough change that I figured I'd better talk > with you about it. One idea I had was to generate the WAL record as the > output of the decision-making code, and then the critical section would > use the WAL record as its guide to what to do to the buffers. Hmm, we will think. > BTW, I'm confused about gistadjscans: seems to me that's either broken > or unnecessary. Since we no longer hold an exclusive lock on a gist > index while inserting into it, the comment at gistscan.c line 33 is > certainly wrong now. If the adjustment is still necessary then some > other mechanism needs to be devised to get the information to other > backends. If it's not necessary then I think we should take it out. > I'm not totally familiar with the gist logic, but it looks to me like > gistadjscans is only called during an insert into a non-leaf page, > which makes me think it might be unnecessary --- do gist index scans > ever stop on non-leaf pages? It seems to me - you are right. I missed that. I'll remove it and test changes. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
>> Hmm. That probably needs to be redesigned then. > Don't understand. Root will be fully regenerated and filled with invalid tuples. Well, ISTM that if complete replay of the WAL sequence yields an invalid index, then you've not got an adequate design for the WAL data. Special recovery actions really should only be needed if the WAL log is incomplete (ie, it ends in the middle of a page-split sequence). > I see, there is a problem with gistSplit: it can generate more than 3 pages > (three - is a limit of XLogInsert) in a case of bad user's picksplit method... I think we are OK on that, actually, because the page-split WAL record is designed to rewrite all of the pages completely. Only pages that are going to be incrementally updated need to be exposed to XLogInsert. So the root-page case in page-update is the only one that seems inadequate. regards, tom lane
Tom Lane wrote: >>> Hmm. That probably needs to be redesigned then. > >> Don't understand. Root will be fully regenerated and filled with invalid tuples. > > Well, ISTM that if complete replay of the WAL sequence yields an invalid > index, then you've not got an adequate design for the WAL data. Special > recovery actions really should only be needed if the WAL log is > incomplete (ie, it ends in the middle of a page-split sequence). Hm. The mentioned piece of code executes only for incomplete inserts. It is a part of gistContinueInsert() called from gist_xlog_cleanup()... >> I see, there is a problem with gistSplit: it can generate more than 3 pages >> (three - is a limit of XLogInsert) in a case of bad user's picksplit method... > > I think we are OK on that, actually, because the page-split WAL record > is designed to rewrite all of the pages completely. Only pages that are > going to be incrementally updated need to be exposed to XLogInsert. > So the root-page case in page-update is the only one that seems inadequate. I was reading nbtree code and noticed, that new (just created) page fills "in place", without START_CRIT_SECTION. As I understand, new page we can easy change while it hasn't any link from other tree/pages. gistSplit fills shadow (returned by PageGetTempPage()) of original page, so, in that case, splitSplit can postpone PageRestoreTempPage() to caller. I can make changes of gistSplit() on this week. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/