Thread: Possibly too stringent Assert() in b-tree code

Possibly too stringent Assert() in b-tree code

From

Antonin Houska

Date:

19 September 2016, 08:02:54

I've recently seen this when using 9.6:

#0  0x00007f147892f0c7 in raise () from /lib64/libc.so.6
#1  0x00007f1478930478 in abort () from /lib64/libc.so.6
#2  0x00000000009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 "!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof(PageHeaderData, pd_linp)))", errorType=0x9f2e8a "FailedAssertion",   fileName=0x9f2e60
"../../../../src/include/storage/bufpage.h",lineNumber=314) at assert.c:54 
#3  0x00000000004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") at
../../../../src/include/storage/bufpage.h:314
#4  0x00000000004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) at nbtpage.c:629
#5  0x00000000004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, firstright=367, newitemoff=408, newitemsz=16,
newitem=0x2608330,newitemonleft=0 '\000') at nbtinsert.c:986 
#6  0x00000000004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, cbuf=0, stack=0x26090c8, itup=0x2608330,
newitemoff=408,split_only_page=0 '\000') at nbtinsert.c:781 
#7  0x00000000004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, checkUnique=UNIQUE_CHECK_YES,
heapRel=0x25aa0a0)at nbtinsert.c:204 
#8  0x00000000004f3b73 in btinsert (rel=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "",
ht_ctid=0x260927c,heapRel=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279 
#9  0x00000000004e7964 in index_insert (indexRelation=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "",
heap_t_ctid=0x260927c,heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES)   at indexam.c:204 

Of course, the database could have been corrupted after having encountered
many crashes during my experiments. Neverthelesss, even without in-depth
knowledge of the b-tree code I suspect that this stack trace might reflect a
legal situation. In partcular, if _bt_page_recyclable() returned on this
condition:
if (PageIsNew(page))    return true;


Unfortunately I no longer have the cluster nor the core dump, but I remember
all the fields of PageHeader were indeed zeroed.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Re: Possibly too stringent Assert() in b-tree code

From

Amit Kapila

Date:

19 September 2016, 11:07:39

On Mon, Sep 19, 2016 at 1:32 PM, Antonin Houska <ah@cybertec.at> wrote:
> I've recently seen this when using 9.6:
>
> #0  0x00007f147892f0c7 in raise () from /lib64/libc.so.6
> #1  0x00007f1478930478 in abort () from /lib64/libc.so.6
> #2  0x00000000009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 "!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof(PageHeaderData, pd_linp)))", errorType=0x9f2e8a "FailedAssertion",
 
>     fileName=0x9f2e60 "../../../../src/include/storage/bufpage.h", lineNumber=314) at assert.c:54
> #3  0x00000000004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") at
../../../../src/include/storage/bufpage.h:314
> #4  0x00000000004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) at nbtpage.c:629
> #5  0x00000000004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, firstright=367, newitemoff=408,
newitemsz=16,newitem=0x2608330, newitemonleft=0 '\000') at nbtinsert.c:986
 
> #6  0x00000000004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, cbuf=0, stack=0x26090c8, itup=0x2608330,
newitemoff=408,split_only_page=0 '\000') at nbtinsert.c:781
 
> #7  0x00000000004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, checkUnique=UNIQUE_CHECK_YES,
heapRel=0x25aa0a0)at nbtinsert.c:204
 
> #8  0x00000000004f3b73 in btinsert (rel=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "",
ht_ctid=0x260927c,heapRel=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279
 
> #9  0x00000000004e7964 in index_insert (indexRelation=0x7f14795a1a50, values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0
"",heap_t_ctid=0x260927c, heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES)
 
>     at indexam.c:204
>
> Of course, the database could have been corrupted after having encountered
> many crashes during my experiments. Neverthelesss, even without in-depth
> knowledge of the b-tree code I suspect that this stack trace might reflect a
> legal situation. In partcular, if _bt_page_recyclable() returned on this
> condition:
>
>         if (PageIsNew(page))
>                 return true;
>

I think you have a valid point.  It seems we don't need to write WAL
for reuse page (aka don't call _bt_log_reuse_page()), if the page is
new, as the only purpose of that log is to handle conflict based on
transaction id stored in special area which will be anyway zero.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Possibly too stringent Assert() in b-tree code

From

Robert Haas

Date:

19 September 2016, 12:41:30

On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Of course, the database could have been corrupted after having encountered
>> many crashes during my experiments. Neverthelesss, even without in-depth
>> knowledge of the b-tree code I suspect that this stack trace might reflect a
>> legal situation. In partcular, if _bt_page_recyclable() returned on this
>> condition:
>>
>>         if (PageIsNew(page))
>>                 return true;
>>
>
> I think you have a valid point.  It seems we don't need to write WAL
> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
> new, as the only purpose of that log is to handle conflict based on
> transaction id stored in special area which will be anyway zero.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Possibly too stringent Assert() in b-tree code

From

Tom Lane

Date:

22 September 2016, 18:51:26

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think you have a valid point.  It seems we don't need to write WAL
>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
>> new, as the only purpose of that log is to handle conflict based on
>> transaction id stored in special area which will be anyway zero.

> +1.

This is clearly an oversight in Simon's patch fafa374f2, which introduced
this code without any consideration for the possibility that the page
doesn't have a valid special area.  We could prevent the crash by
doing nothing if PageIsNew, a la
               if (_bt_page_recyclable(page))               {                   /*                    * If we are
generatingWAL for Hot Standby then create a                    * WAL record that will allow us to conflict with queries
                  * running on standby.                    */
 
-                   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+                   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
+                       !PageIsNew(page))                   {                       BTPageOpaque opaque =
(BTPageOpaque)PageGetSpecialPointer(page);
 
                       _bt_log_reuse_page(rel, blkno, opaque->btpo.xact);                   }
                   /* Okay to use page.  Re-initialize and return it */

but I'm not very clear on whether this is a safe fix, mainly because
I don't understand what the reuse WAL record really accomplishes.
Maybe we need to instead generate a reuse record with some special
transaction ID indicating worst-case assumptions?
        regards, tom lane

Re: Possibly too stringent Assert() in b-tree code

From

Amit Kapila

Date:

24 September 2016, 13:06:56

On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I think you have a valid point.  It seems we don't need to write WAL
>>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
>>> new, as the only purpose of that log is to handle conflict based on
>>> transaction id stored in special area which will be anyway zero.
>
>> +1.
>
> This is clearly an oversight in Simon's patch fafa374f2, which introduced
> this code without any consideration for the possibility that the page
> doesn't have a valid special area.  We could prevent the crash by
> doing nothing if PageIsNew, a la
>
>                 if (_bt_page_recyclable(page))
>                 {
>                     /*
>                      * If we are generating WAL for Hot Standby then create a
>                      * WAL record that will allow us to conflict with queries
>                      * running on standby.
>                      */
> -                   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
> +                   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
> +                       !PageIsNew(page))
>                     {
>                         BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
>
>                         _bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
>                     }
>
>                     /* Okay to use page.  Re-initialize and return it */
>
> but I'm not very clear on whether this is a safe fix, mainly because
> I don't understand what the reuse WAL record really accomplishes.
> Maybe we need to instead generate a reuse record with some special
> transaction ID indicating worst-case assumptions?
>

I think it is basically to ensure that the queries on standby should
not be considering the page being recycled as valid and if there is
any pending query to which this page is visible, cancel it.  If this
understanding is correct, then I think the fix you are proposing is
correct.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Possibly too stringent Assert() in b-tree code

From

Tom Lane

Date:

25 September 2016, 16:01:13

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
>> this code without any consideration for the possibility that the page
>> doesn't have a valid special area. ...
>> but I'm not very clear on whether this is a safe fix, mainly because
>> I don't understand what the reuse WAL record really accomplishes.
>> Maybe we need to instead generate a reuse record with some special
>> transaction ID indicating worst-case assumptions?

> I think it is basically to ensure that the queries on standby should
> not be considering the page being recycled as valid and if there is
> any pending query to which this page is visible, cancel it.  If this
> understanding is correct, then I think the fix you are proposing is
> correct.

After thinking about it some more, I do not understand why we are emitting
WAL here at all in *any* case, either for a new page or for a deleted one
that we're about to recycle.  Why wouldn't the appropriate actions have
been taken when the page was deleted, or even before that when its last
tuple was deleted?

In other words, if I'm left to fix it, I will do so by reverting
fafa374f2 lock stock and barrel.  I think it was ill-considered and
unnecessary.
        regards, tom lane

Re: Possibly too stringent Assert() in b-tree code

From

Amit Kapila

Date:

26 September 2016, 03:25:05

On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
>>> this code without any consideration for the possibility that the page
>>> doesn't have a valid special area. ...
>>> but I'm not very clear on whether this is a safe fix, mainly because
>>> I don't understand what the reuse WAL record really accomplishes.
>>> Maybe we need to instead generate a reuse record with some special
>>> transaction ID indicating worst-case assumptions?
>
>> I think it is basically to ensure that the queries on standby should
>> not be considering the page being recycled as valid and if there is
>> any pending query to which this page is visible, cancel it.  If this
>> understanding is correct, then I think the fix you are proposing is
>> correct.
>
> After thinking about it some more, I do not understand why we are emitting
> WAL here at all in *any* case, either for a new page or for a deleted one
> that we're about to recycle.  Why wouldn't the appropriate actions have
> been taken when the page was deleted, or even before that when its last
> tuple was deleted?
>

It seems to me that we do take actions for conflict resolution during
the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
which we emit in vacuum), but not sure if that is sufficient.
Consider a case where the new transaction is started on standby after
page deletion and the same still precedes the value of xact on page,
such transactions must be cancelled before we can reuse the page.  I
think the fact that before recycling the page we do ensure that no
transaction is interested in that page in master indicates something
similar is required in standby as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Possibly too stringent Assert() in b-tree code

From

Amit Kapila

Date:

26 September 2016, 03:42:16

On Mon, Sep 26, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
>>>> this code without any consideration for the possibility that the page
>>>> doesn't have a valid special area. ...
>>>> but I'm not very clear on whether this is a safe fix, mainly because
>>>> I don't understand what the reuse WAL record really accomplishes.
>>>> Maybe we need to instead generate a reuse record with some special
>>>> transaction ID indicating worst-case assumptions?
>>
>>> I think it is basically to ensure that the queries on standby should
>>> not be considering the page being recycled as valid and if there is
>>> any pending query to which this page is visible, cancel it.  If this
>>> understanding is correct, then I think the fix you are proposing is
>>> correct.
>>
>> After thinking about it some more, I do not understand why we are emitting
>> WAL here at all in *any* case, either for a new page or for a deleted one
>> that we're about to recycle.  Why wouldn't the appropriate actions have
>> been taken when the page was deleted, or even before that when its last
>> tuple was deleted?
>>
>
> It seems to me that we do take actions for conflict resolution during
> the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
> which we emit in vacuum), but not sure if that is sufficient.
> Consider a case where the new transaction is started on standby after
>

Here by new transaction, I intend to say some newer snapshot with
valid MyPgXact->xmin.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Possibly too stringent Assert() in b-tree code

From

Kyotaro HORIGUCHI

Date:

26 June 2018, 12:47:48

Hello. I came from the following mail.


https://www.postgresql.org/message-id/flat/0F97FA9ABBDBE54F91744A9B37151A5116E4DD@g01jpexmbkw24#8be5cc9b7c3cf454a82e561d84f4118f

At Mon, 26 Sep 2016 09:12:04 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K5YyDmndko0zzW6WNCN_DGFVHa6DCYcyuvkBWTH5+nUQ@mail.gmail.com>
> On Mon, Sep 26, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Amit Kapila <amit.kapila16@gmail.com> writes:
> >>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
> >>>> this code without any consideration for the possibility that the page
> >>>> doesn't have a valid special area. ...
> >>>> but I'm not very clear on whether this is a safe fix, mainly because
> >>>> I don't understand what the reuse WAL record really accomplishes.
> >>>> Maybe we need to instead generate a reuse record with some special
> >>>> transaction ID indicating worst-case assumptions?
> >>
> >>> I think it is basically to ensure that the queries on standby should
> >>> not be considering the page being recycled as valid and if there is
> >>> any pending query to which this page is visible, cancel it.  If this
> >>> understanding is correct, then I think the fix you are proposing is
> >>> correct.
> >>
> >> After thinking about it some more, I do not understand why we are emitting
> >> WAL here at all in *any* case, either for a new page or for a deleted one
> >> that we're about to recycle.  Why wouldn't the appropriate actions have
> >> been taken when the page was deleted, or even before that when its last
> >> tuple was deleted?
> >>
> >
> > It seems to me that we do take actions for conflict resolution during
> > the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
> > which we emit in vacuum), but not sure if that is sufficient.
> > Consider a case where the new transaction is started on standby after
> >
> 
> Here by new transaction, I intend to say some newer snapshot with
> valid MyPgXact->xmin.

I agree to the diagnosis. So the WAL record is not necessary if
it is a new page since no one cannot be grabbing it.

_bt_page_recyclable() has two callers. _bt_getbuf and
btvacuumpage. A comment in btvacuumpage is saying that.

>      * _bt_checkpage(), which will barf on an all-zero page. We want to
>      * recycle all-zero pages, not fail.  Also, we want to use a nondefault

So it is right thing for _bt_page_recyclable() to return true for
zeroed pages and it is consistent with the comment in
_bt_page_recyclable().

At the problematic point, a new page is safe to recycle but
there's no need to issue WAL record since it is not recycled with
the meaning that _bt_log_resuse_page() thinking.

> _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
> {
>   xl_btree_reuse_page xlrec_reuse;
> 
>   /*
>    * Note that we don't register the buffer with the record, because this
>    * operation doesn't modify the page. This record only exists to provide a
>    * conflict point for Hot Standby.
>    */

I think we have all requred testimony from exiting code and
comments. FWIW my conclustion is that Tom's solution upthread is
right thing to do and needs addition in comment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a24e64156a..b39634cafd 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -782,9 +782,12 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
                     /*
                      * If we are generating WAL for Hot Standby then create a
                      * WAL record that will allow us to conflict with queries
-                     * running on standby.
+                     * running on standby. No need to do that if the page is
+                     * all-zoeroed since no one cannot be interested in the
+                     * page on standbys. See _bt_page_recyclable().
                      */
-                    if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+                    if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
+                        !PageIsNew())
                     {
                         BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);

Re: [HACKERS] Possibly too stringent Assert() in b-tree code

From

Tom Lane

Date:

10 July 2018, 02:29:42

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Mon, 26 Sep 2016 09:12:04 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K5YyDmndko0zzW6WNCN_DGFVHa6DCYcyuvkBWTH5+nUQ@mail.gmail.com>
>>> It seems to me that we do take actions for conflict resolution during
>>> the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
>>> which we emit in vacuum), but not sure if that is sufficient.
>>> Consider a case where the new transaction is started on standby after
>>
>> Here by new transaction, I intend to say some newer snapshot with
>> valid MyPgXact->xmin.

> I agree to the diagnosis. So the WAL record is not necessary if
> it is a new page since no one cannot be grabbing it.

Thanks for reviving this thread and reviewing the problem.
I pushed the patch now with some more work on the comments.

            regards, tom lane