Thread: pgsql: Improve gist XLOG code to follow the coding rules needed to

pgsql: Improve gist XLOG code to follow the coding rules needed to

From

tgl@postgresql.org (Tom Lane)

Date:

30 March 2006, 22:03:14

Log Message:
-----------
Improve gist XLOG code to follow the coding rules needed to prevent
torn-page problems.  This introduces some issues of its own, mainly
that there are now some critical sections of unreasonably broad scope,
but it's a step forward anyway.  Further cleanup will require some
code refactoring that I'd prefer to get Oleg and Teodor involved in.

Modified Files:
--------------
    pgsql/src/backend/access/gist:
        gist.c (r1.129 -> r1.130)
        (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gist.c.diff?r1=1.129&r2=1.130)
        gistvacuum.c (r1.16 -> r1.17)
        (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gistvacuum.c.diff?r1=1.16&r2=1.17)
        gistxlog.c (r1.12 -> r1.13)
        (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/gist/gistxlog.c.diff?r1=1.12&r2=1.13)
    pgsql/src/include/access:
        gist_private.h (r1.11 -> r1.12)
        (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/include/access/gist_private.h.diff?r1=1.11&r2=1.12)

Re: pgsql: Improve gist XLOG code to follow the coding

From

Teodor Sigaev

Date:

31 March 2006, 08:35:34

> -----------
> Improve gist XLOG code to follow the coding rules needed to prevent
> torn-page problems.  This introduces some issues of its own, mainly
> that there are now some critical sections of unreasonably broad scope,
> but it's a step forward anyway.  Further cleanup will require some
> code refactoring that I'd prefer to get Oleg and Teodor involved in.

Here I am. Oleg now is in expedition to solar eclipse (should return soon).

Some answers on your XXX
* gist.c:gistnewroot
   GISTInitBuffer(buffer, 0);  /* XXX not F_LEAF? */
   F_LEAF will be never set in new root, because gistnewroot is called when
   split of old root page is occured. Even it was a leaf (tree had only one
   page), new root will be non-leaf page.

* gistxlog.c:gistContinueInsert
      /*
          * XXX fall out to avoid making LOG message at bottom of routine.
          * I think the logic for when to emit that message is all wrong...
          */
   At line 543 itup vector was filled by invalid tuple, so newly filled root
   will contains only invalid tuples. Hence, reindex/vacuum full is required.


Sorry, I missed something, what is torn-page problem?  Looking in your changes,
I decided, that now it needs to around by CRIT_SECTION any buffer/page changes?




--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: pgsql: Improve gist XLOG code to follow the coding

From

Tom Lane

Date:

31 March 2006, 14:35:29

Teodor Sigaev <teodor@sigaev.ru> writes:
> Here I am. Oleg now is in expedition to solar eclipse (should return soon).

Cool ... I hope to see one someday ...

> * gistxlog.c:gistContinueInsert
>       /*
>           * XXX fall out to avoid making LOG message at bottom of routine.
>           * I think the logic for when to emit that message is all wrong...
>           */
>    At line 543 itup vector was filled by invalid tuple, so newly filled root
>    will contains only invalid tuples. Hence, reindex/vacuum full is required.

Hmm.  That probably needs to be redesigned then.  The problem is that as
the thing is written, you *must* reinit the buffer here --- the code I
removed that checked the page LSN was unsafe.  If you want to depend on
the prior page contents during replay, you have to give XLogInsert the
opportunity to save the whole page when the xlog entry is made.

> Sorry, I missed something, what is torn-page problem?

The torn-page problem is where a page only gets partially written to
disk during a power failure.  When we come back up, the LSN might look
like the page is up to date, when in reality the data is
old/inconsistent.  We deal with this by having the first WAL record that
touches a given page after a checkpoint contain a copy of the whole
page, and then we can rewrite the page from that copy instead of trying
to replay the incremental update.  I've fixed the code to do this
correctly for the gistRedoPageUpdateRecord case, but I'm not sure what
we do for gistContinueInsert.

BTW, there's another problem with gistContinueInsert: it's not making
WAL entries for the actions it takes.  It needs to do so --- consider
for example the PITR case, where the log will be shipped somewhere else
and then followed verbatim.  So you've got to add WAL entries for any
recovery actions you take after reading the existing WAL entries.

The critical-section problem is not related to that; it's about being
sure that we don't make any changes to shared buffers that don't have
an associated WAL record.  (See "WAL dirty-buffer management bug" thread
on -hackers.)  I fixed the gist code so that it holds a critical section
throughout doing an insertion or vacuum operation, but that's really far
too wide --- there's too much chance of an elog() somewhere in the
process.  In particular, we really don't want to call any user-defined
datatype functions inside the critical section.  So the process needs to
be to compute all the required changes *but not make them*, then start
the critical section, then apply the changes and make the WAL record.
This seemed like a large enough change that I figured I'd better talk
with you about it.  One idea I had was to generate the WAL record as the
output of the decision-making code, and then the critical section would
use the WAL record as its guide to what to do to the buffers.

BTW, I'm confused about gistadjscans: seems to me that's either broken
or unnecessary.  Since we no longer hold an exclusive lock on a gist
index while inserting into it, the comment at gistscan.c line 33 is
certainly wrong now.  If the adjustment is still necessary then some
other mechanism needs to be devised to get the information to other
backends.  If it's not necessary then I think we should take it out.
I'm not totally familiar with the gist logic, but it looks to me like
gistadjscans is only called during an insert into a non-leaf page,
which makes me think it might be unnecessary --- do gist index scans
ever stop on non-leaf pages?

            regards, tom lane

Re: pgsql: Improve gist XLOG code to follow the coding

From

Teodor Sigaev

Date:

03 April 2006, 08:40:36

>>    At line 543 itup vector was filled by invalid tuple, so newly filled root
>>    will contains only invalid tuples. Hence, reindex/vacuum full is required.
>
> Hmm.  That probably needs to be redesigned then.  The problem is that as
> the thing is written, you *must* reinit the buffer here --- the code I
> removed that checked the page LSN was unsafe.  If you want to depend on
> the prior page contents during replay, you have to give XLogInsert the
> opportunity to save the whole page when the xlog entry is made.

Don't understand. Root will be fully regenerated and filled with invalid tuples.


> old/inconsistent.  We deal with this by having the first WAL record that
> touches a given page after a checkpoint contain a copy of the whole
> page, and then we can rewrite the page from that copy instead of trying

I see, there is a problem with gistSplit: it can generate more than 3 pages
(three - is a limit of XLogInsert) in a case of bad user's picksplit method...
But it is a very "corner" case...

In practice, I don't see more than three pages, three pages are rarely generated
by gist__int_ops (contrib/intarray).

> BTW, there's another problem with gistContinueInsert: it's not making
> WAL entries for the actions it takes.  It needs to do so --- consider
> for example the PITR case, where the log will be shipped somewhere else
> and then followed verbatim.  So you've got to add WAL entries for any
> recovery actions you take after reading the existing WAL entries.

Ugh, I see

> process.  In particular, we really don't want to call any user-defined
> datatype functions inside the critical section.

Agree

> be to compute all the required changes *but not make them*, then start
> the critical section, then apply the changes and make the WAL record.
> This seemed like a large enough change that I figured I'd better talk
> with you about it.  One idea I had was to generate the WAL record as the
> output of the decision-making code, and then the critical section would
> use the WAL record as its guide to what to do to the buffers.

Hmm, we will think.


> BTW, I'm confused about gistadjscans: seems to me that's either broken
> or unnecessary.  Since we no longer hold an exclusive lock on a gist
> index while inserting into it, the comment at gistscan.c line 33 is
> certainly wrong now.  If the adjustment is still necessary then some
> other mechanism needs to be devised to get the information to other
> backends.  If it's not necessary then I think we should take it out.
> I'm not totally familiar with the gist logic, but it looks to me like
> gistadjscans is only called during an insert into a non-leaf page,
> which makes me think it might be unnecessary --- do gist index scans
> ever stop on non-leaf pages?

It seems to me - you are right. I missed that. I'll remove it and test changes.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: pgsql: Improve gist XLOG code to follow the coding

From

Tom Lane

Date:

03 April 2006, 13:37:19

>> Hmm.  That probably needs to be redesigned then.

> Don't understand. Root will be fully regenerated and filled with invalid tuples.

Well, ISTM that if complete replay of the WAL sequence yields an invalid
index, then you've not got an adequate design for the WAL data.  Special
recovery actions really should only be needed if the WAL log is
incomplete (ie, it ends in the middle of a page-split sequence).

> I see, there is a problem with gistSplit: it can generate more than 3 pages
> (three - is a limit of XLogInsert) in a case of bad user's picksplit method...

I think we are OK on that, actually, because the page-split WAL record
is designed to rewrite all of the pages completely.  Only pages that are
going to be incrementally updated need to be exposed to XLogInsert.
So the root-page case in page-update is the only one that seems inadequate.

            regards, tom lane

Re: pgsql: Improve gist XLOG code to follow the coding

From

Teodor Sigaev

Date:

03 April 2006, 16:19:51

Tom Lane wrote:
>>> Hmm.  That probably needs to be redesigned then.
>
>> Don't understand. Root will be fully regenerated and filled with invalid tuples.
>
> Well, ISTM that if complete replay of the WAL sequence yields an invalid
> index, then you've not got an adequate design for the WAL data.  Special
> recovery actions really should only be needed if the WAL log is
> incomplete (ie, it ends in the middle of a page-split sequence).

Hm. The mentioned piece of code executes only for incomplete inserts.
It is a part of gistContinueInsert() called from  gist_xlog_cleanup()...

>> I see, there is a problem with gistSplit: it can generate more than 3 pages
>> (three - is a limit of XLogInsert) in a case of bad user's picksplit method...
>
> I think we are OK on that, actually, because the page-split WAL record
> is designed to rewrite all of the pages completely.  Only pages that are
> going to be incrementally updated need to be exposed to XLogInsert.
> So the root-page case in page-update is the only one that seems inadequate.

I was reading nbtree code and noticed, that new (just created) page fills "in
place", without START_CRIT_SECTION. As I understand, new page we can easy change
while it hasn't any link from other tree/pages.

gistSplit fills shadow (returned by PageGetTempPage()) of original page, so,
in that case, splitSplit can postpone PageRestoreTempPage() to caller. I can
make changes of gistSplit() on this week.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/