Thread: Possibly too stringent Assert() in b-tree code
I've recently seen this when using 9.6: #0 0x00007f147892f0c7 in raise () from /lib64/libc.so.6 #1 0x00007f1478930478 in abort () from /lib64/libc.so.6 #2 0x00000000009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof(PageHeaderData, pd_linp)))", errorType=0x9f2e8a "FailedAssertion", fileName=0x9f2e60 "../../../../src/include/storage/bufpage.h",lineNumber=314) at assert.c:54 #3 0x00000000004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") at ../../../../src/include/storage/bufpage.h:314 #4 0x00000000004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) at nbtpage.c:629 #5 0x00000000004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, firstright=367, newitemoff=408, newitemsz=16, newitem=0x2608330,newitemonleft=0 '\000') at nbtinsert.c:986 #6 0x00000000004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, cbuf=0, stack=0x26090c8, itup=0x2608330, newitemoff=408,split_only_page=0 '\000') at nbtinsert.c:781 #7 0x00000000004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, checkUnique=UNIQUE_CHECK_YES, heapRel=0x25aa0a0)at nbtinsert.c:204 #8 0x00000000004f3b73 in btinsert (rel=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", ht_ctid=0x260927c,heapRel=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279 #9 0x00000000004e7964 in index_insert (indexRelation=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", heap_t_ctid=0x260927c,heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at indexam.c:204 Of course, the database could have been corrupted after having encountered many crashes during my experiments. Neverthelesss, even without in-depth knowledge of the b-tree code I suspect that this stack trace might reflect a legal situation. In partcular, if _bt_page_recyclable() returned on this condition: if (PageIsNew(page)) return true; Unfortunately I no longer have the cluster nor the core dump, but I remember all the fields of PageHeader were indeed zeroed. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On Mon, Sep 19, 2016 at 1:32 PM, Antonin Houska <ah@cybertec.at> wrote: > I've recently seen this when using 9.6: > > #0 0x00007f147892f0c7 in raise () from /lib64/libc.so.6 > #1 0x00007f1478930478 in abort () from /lib64/libc.so.6 > #2 0x00000000009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof(PageHeaderData, pd_linp)))", errorType=0x9f2e8a "FailedAssertion", > fileName=0x9f2e60 "../../../../src/include/storage/bufpage.h", lineNumber=314) at assert.c:54 > #3 0x00000000004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") at ../../../../src/include/storage/bufpage.h:314 > #4 0x00000000004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) at nbtpage.c:629 > #5 0x00000000004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, firstright=367, newitemoff=408, newitemsz=16,newitem=0x2608330, newitemonleft=0 '\000') at nbtinsert.c:986 > #6 0x00000000004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, cbuf=0, stack=0x26090c8, itup=0x2608330, newitemoff=408,split_only_page=0 '\000') at nbtinsert.c:781 > #7 0x00000000004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, checkUnique=UNIQUE_CHECK_YES, heapRel=0x25aa0a0)at nbtinsert.c:204 > #8 0x00000000004f3b73 in btinsert (rel=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", ht_ctid=0x260927c,heapRel=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279 > #9 0x00000000004e7964 in index_insert (indexRelation=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "",heap_t_ctid=0x260927c, heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) > at indexam.c:204 > > Of course, the database could have been corrupted after having encountered > many crashes during my experiments. Neverthelesss, even without in-depth > knowledge of the b-tree code I suspect that this stack trace might reflect a > legal situation. In partcular, if _bt_page_recyclable() returned on this > condition: > > if (PageIsNew(page)) > return true; > I think you have a valid point. It seems we don't need to write WAL for reuse page (aka don't call _bt_log_reuse_page()), if the page is new, as the only purpose of that log is to handle conflict based on transaction id stored in special area which will be anyway zero. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Of course, the database could have been corrupted after having encountered >> many crashes during my experiments. Neverthelesss, even without in-depth >> knowledge of the b-tree code I suspect that this stack trace might reflect a >> legal situation. In partcular, if _bt_page_recyclable() returned on this >> condition: >> >> if (PageIsNew(page)) >> return true; >> > > I think you have a valid point. It seems we don't need to write WAL > for reuse page (aka don't call _bt_log_reuse_page()), if the page is > new, as the only purpose of that log is to handle conflict based on > transaction id stored in special area which will be anyway zero. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think you have a valid point. It seems we don't need to write WAL >> for reuse page (aka don't call _bt_log_reuse_page()), if the page is >> new, as the only purpose of that log is to handle conflict based on >> transaction id stored in special area which will be anyway zero. > +1. This is clearly an oversight in Simon's patch fafa374f2, which introduced this code without any consideration for the possibility that the page doesn't have a valid special area. We could prevent the crash by doing nothing if PageIsNew, a la if (_bt_page_recyclable(page)) { /* * If we are generatingWAL for Hot Standby then create a * WAL record that will allow us to conflict with queries * running on standby. */ - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) + if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) && + !PageIsNew(page)) { BTPageOpaque opaque = (BTPageOpaque)PageGetSpecialPointer(page); _bt_log_reuse_page(rel, blkno, opaque->btpo.xact); } /* Okay to use page. Re-initialize and return it */ but I'm not very clear on whether this is a safe fix, mainly because I don't understand what the reuse WAL record really accomplishes. Maybe we need to instead generate a reuse record with some special transaction ID indicating worst-case assumptions? regards, tom lane
On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I think you have a valid point. It seems we don't need to write WAL >>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is >>> new, as the only purpose of that log is to handle conflict based on >>> transaction id stored in special area which will be anyway zero. > >> +1. > > This is clearly an oversight in Simon's patch fafa374f2, which introduced > this code without any consideration for the possibility that the page > doesn't have a valid special area. We could prevent the crash by > doing nothing if PageIsNew, a la > > if (_bt_page_recyclable(page)) > { > /* > * If we are generating WAL for Hot Standby then create a > * WAL record that will allow us to conflict with queries > * running on standby. > */ > - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) > + if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) && > + !PageIsNew(page)) > { > BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > _bt_log_reuse_page(rel, blkno, opaque->btpo.xact); > } > > /* Okay to use page. Re-initialize and return it */ > > but I'm not very clear on whether this is a safe fix, mainly because > I don't understand what the reuse WAL record really accomplishes. > Maybe we need to instead generate a reuse record with some special > transaction ID indicating worst-case assumptions? > I think it is basically to ensure that the queries on standby should not be considering the page being recycled as valid and if there is any pending query to which this page is visible, cancel it. If this understanding is correct, then I think the fix you are proposing is correct. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is clearly an oversight in Simon's patch fafa374f2, which introduced >> this code without any consideration for the possibility that the page >> doesn't have a valid special area. ... >> but I'm not very clear on whether this is a safe fix, mainly because >> I don't understand what the reuse WAL record really accomplishes. >> Maybe we need to instead generate a reuse record with some special >> transaction ID indicating worst-case assumptions? > I think it is basically to ensure that the queries on standby should > not be considering the page being recycled as valid and if there is > any pending query to which this page is visible, cancel it. If this > understanding is correct, then I think the fix you are proposing is > correct. After thinking about it some more, I do not understand why we are emitting WAL here at all in *any* case, either for a new page or for a deleted one that we're about to recycle. Why wouldn't the appropriate actions have been taken when the page was deleted, or even before that when its last tuple was deleted? In other words, if I'm left to fix it, I will do so by reverting fafa374f2 lock stock and barrel. I think it was ill-considered and unnecessary. regards, tom lane
On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> This is clearly an oversight in Simon's patch fafa374f2, which introduced >>> this code without any consideration for the possibility that the page >>> doesn't have a valid special area. ... >>> but I'm not very clear on whether this is a safe fix, mainly because >>> I don't understand what the reuse WAL record really accomplishes. >>> Maybe we need to instead generate a reuse record with some special >>> transaction ID indicating worst-case assumptions? > >> I think it is basically to ensure that the queries on standby should >> not be considering the page being recycled as valid and if there is >> any pending query to which this page is visible, cancel it. If this >> understanding is correct, then I think the fix you are proposing is >> correct. > > After thinking about it some more, I do not understand why we are emitting > WAL here at all in *any* case, either for a new page or for a deleted one > that we're about to recycle. Why wouldn't the appropriate actions have > been taken when the page was deleted, or even before that when its last > tuple was deleted? > It seems to me that we do take actions for conflict resolution during the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO which we emit in vacuum), but not sure if that is sufficient. Consider a case where the new transaction is started on standby after page deletion and the same still precedes the value of xact on page, such transactions must be cancelled before we can reuse the page. I think the fact that before recycling the page we do ensure that no transaction is interested in that page in master indicates something similar is required in standby as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 26, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Kapila <amit.kapila16@gmail.com> writes: >>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced >>>> this code without any consideration for the possibility that the page >>>> doesn't have a valid special area. ... >>>> but I'm not very clear on whether this is a safe fix, mainly because >>>> I don't understand what the reuse WAL record really accomplishes. >>>> Maybe we need to instead generate a reuse record with some special >>>> transaction ID indicating worst-case assumptions? >> >>> I think it is basically to ensure that the queries on standby should >>> not be considering the page being recycled as valid and if there is >>> any pending query to which this page is visible, cancel it. If this >>> understanding is correct, then I think the fix you are proposing is >>> correct. >> >> After thinking about it some more, I do not understand why we are emitting >> WAL here at all in *any* case, either for a new page or for a deleted one >> that we're about to recycle. Why wouldn't the appropriate actions have >> been taken when the page was deleted, or even before that when its last >> tuple was deleted? >> > > It seems to me that we do take actions for conflict resolution during > the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO > which we emit in vacuum), but not sure if that is sufficient. > Consider a case where the new transaction is started on standby after > Here by new transaction, I intend to say some newer snapshot with valid MyPgXact->xmin. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hello. I came from the following mail. https://www.postgresql.org/message-id/flat/0F97FA9ABBDBE54F91744A9B37151A5116E4DD@g01jpexmbkw24#8be5cc9b7c3cf454a82e561d84f4118f At Mon, 26 Sep 2016 09:12:04 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K5YyDmndko0zzW6WNCN_DGFVHa6DCYcyuvkBWTH5+nUQ@mail.gmail.com> > On Mon, Sep 26, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Amit Kapila <amit.kapila16@gmail.com> writes: > >>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced > >>>> this code without any consideration for the possibility that the page > >>>> doesn't have a valid special area. ... > >>>> but I'm not very clear on whether this is a safe fix, mainly because > >>>> I don't understand what the reuse WAL record really accomplishes. > >>>> Maybe we need to instead generate a reuse record with some special > >>>> transaction ID indicating worst-case assumptions? > >> > >>> I think it is basically to ensure that the queries on standby should > >>> not be considering the page being recycled as valid and if there is > >>> any pending query to which this page is visible, cancel it. If this > >>> understanding is correct, then I think the fix you are proposing is > >>> correct. > >> > >> After thinking about it some more, I do not understand why we are emitting > >> WAL here at all in *any* case, either for a new page or for a deleted one > >> that we're about to recycle. Why wouldn't the appropriate actions have > >> been taken when the page was deleted, or even before that when its last > >> tuple was deleted? > >> > > > > It seems to me that we do take actions for conflict resolution during > > the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO > > which we emit in vacuum), but not sure if that is sufficient. > > Consider a case where the new transaction is started on standby after > > > > Here by new transaction, I intend to say some newer snapshot with > valid MyPgXact->xmin. I agree to the diagnosis. So the WAL record is not necessary if it is a new page since no one cannot be grabbing it. _bt_page_recyclable() has two callers. _bt_getbuf and btvacuumpage. A comment in btvacuumpage is saying that. > * _bt_checkpage(), which will barf on an all-zero page. We want to > * recycle all-zero pages, not fail. Also, we want to use a nondefault So it is right thing for _bt_page_recyclable() to return true for zeroed pages and it is consistent with the comment in _bt_page_recyclable(). At the problematic point, a new page is safe to recycle but there's no need to issue WAL record since it is not recycled with the meaning that _bt_log_resuse_page() thinking. > _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid) > { > xl_btree_reuse_page xlrec_reuse; > > /* > * Note that we don't register the buffer with the record, because this > * operation doesn't modify the page. This record only exists to provide a > * conflict point for Hot Standby. > */ I think we have all requred testimony from exiting code and comments. FWIW my conclustion is that Tom's solution upthread is right thing to do and needs addition in comment. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index a24e64156a..b39634cafd 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -782,9 +782,12 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) /* * If we are generating WAL for Hot Standby then create a * WAL record that will allow us to conflict with queries - * running on standby. + * running on standby. No need to do that if the page is + * all-zoeroed since no one cannot be interested in the + * page on standbys. See _bt_page_recyclable(). */ - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) + if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) && + !PageIsNew()) { BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > At Mon, 26 Sep 2016 09:12:04 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K5YyDmndko0zzW6WNCN_DGFVHa6DCYcyuvkBWTH5+nUQ@mail.gmail.com> >>> It seems to me that we do take actions for conflict resolution during >>> the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO >>> which we emit in vacuum), but not sure if that is sufficient. >>> Consider a case where the new transaction is started on standby after >> >> Here by new transaction, I intend to say some newer snapshot with >> valid MyPgXact->xmin. > I agree to the diagnosis. So the WAL record is not necessary if > it is a new page since no one cannot be grabbing it. Thanks for reviving this thread and reviewing the problem. I pushed the patch now with some more work on the comments. regards, tom lane