Thread: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

[sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Andreas Seltenreich
Date:
Hi,

testing with master as of cf366e97ff, sqlsmith occasionally triggers the
following assertion:

TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)

Backtraces always look like the one below.  It is reproducible on a
cluster once it happens.  I could provide a tarball if needed.

regards,
Andreas

#2  0x00000000008324b1 in ExceptionalCondition (conditionName=conditionName@entry=0x9e4e28 "!(LWLockHeldByMe(((LWLock*)
(&(bufHdr)->content_lock))))",errorType=errorType@entry=0x87b03d "FailedAssertion", fileName=fileName@entry=0x9e5856
"bufmgr.c",lineNumber=lineNumber@entry=3397) at assert.c:54
 
#3  0x0000000000706971 in MarkBufferDirtyHint (buffer=2844, buffer_std=buffer_std@entry=1 '\001') at bufmgr.c:3397
#4  0x00000000004b3ecd in _hash_kill_items (scan=scan@entry=0x66dcf70) at hashutil.c:514
#5  0x00000000004a9c1b in hashendscan (scan=0x66dcf70) at hash.c:512
#6  0x00000000004cf17a in index_endscan (scan=0x66dcf70) at indexam.c:353
#7  0x000000000061fa51 in ExecEndIndexScan (node=0x3093f30) at nodeIndexscan.c:852
#8  0x0000000000608e59 in ExecEndNode (node=<optimized out>) at execProcnode.c:715
#9  0x00000000006045b8 in ExecEndPlan (estate=0x3064000, planstate=<optimized out>) at execMain.c:1540
#10 standard_ExecutorEnd (queryDesc=0x30cb880) at execMain.c:487
#11 0x00000000005c87b0 in PortalCleanup (portal=0x1a60060) at portalcmds.c:302
#12 0x000000000085cbb3 in PortalDrop (portal=0x1a60060, isTopCommit=<optimized out>) at portalmem.c:489
#13 0x0000000000736ed2 in exec_simple_query (query_string=0x315b7a0 "...") at postgres.c:1111
#14 0x0000000000738b51 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1a6c6c8, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4071
 
#15 0x0000000000475fef in BackendRun (port=0x1a65b90) at postmaster.c:4317
#16 BackendStartup (port=0x1a65b90) at postmaster.c:3989
#17 ServerLoop () at postmaster.c:1729
#18 0x00000000006c8662 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x1a3f540) at postmaster.c:1337
#19 0x000000000047729d in main (argc=4, argv=0x1a3f540) at main.c:228



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
Hi,

> testing with master as of cf366e97ff, sqlsmith occasionally triggers the
> following assertion:
>
> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)
>
> Backtraces always look like the one below.  It is reproducible on a
> cluster once it happens.  I could provide a tarball if needed.
>
> regards,
> Andreas
>
> #2  0x00000000008324b1 in ExceptionalCondition (conditionName=conditionName@entry=0x9e4e28
"!(LWLockHeldByMe(((LWLock*)(&(bufHdr)->content_lock))))", errorType=errorType@entry=0x87b03d "FailedAssertion",
fileName=fileName@entry=0x9e5856"bufmgr.c", lineNumber=lineNumber@entry=3397) at assert.c:54 
> #3  0x0000000000706971 in MarkBufferDirtyHint (buffer=2844, buffer_std=buffer_std@entry=1 '\001') at bufmgr.c:3397
> #4  0x00000000004b3ecd in _hash_kill_items (scan=scan@entry=0x66dcf70) at hashutil.c:514
> #5  0x00000000004a9c1b in hashendscan (scan=0x66dcf70) at hash.c:512
> #6  0x00000000004cf17a in index_endscan (scan=0x66dcf70) at indexam.c:353
> #7  0x000000000061fa51 in ExecEndIndexScan (node=0x3093f30) at nodeIndexscan.c:852
> #8  0x0000000000608e59 in ExecEndNode (node=<optimized out>) at execProcnode.c:715
> #9  0x00000000006045b8 in ExecEndPlan (estate=0x3064000, planstate=<optimized out>) at execMain.c:1540
> #10 standard_ExecutorEnd (queryDesc=0x30cb880) at execMain.c:487
> #11 0x00000000005c87b0 in PortalCleanup (portal=0x1a60060) at portalcmds.c:302
> #12 0x000000000085cbb3 in PortalDrop (portal=0x1a60060, isTopCommit=<optimized out>) at portalmem.c:489
> #13 0x0000000000736ed2 in exec_simple_query (query_string=0x315b7a0 "...") at postgres.c:1111
> #14 0x0000000000738b51 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1a6c6c8, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4071 
> #15 0x0000000000475fef in BackendRun (port=0x1a65b90) at postmaster.c:4317
> #16 BackendStartup (port=0x1a65b90) at postmaster.c:3989
> #17 ServerLoop () at postmaster.c:1729
> #18 0x00000000006c8662 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x1a3f540) at postmaster.c:1337
> #19 0x000000000047729d in main (argc=4, argv=0x1a3f540) at main.c:228
>


Hi,

Thanks for reporting this problem. Could you please let me know on for
how long did you run sqlsmith to get this crash. However, I have found
the reason for this crash. This is basically happening when trying to
retrieve the tuples using cursor. Basically the current hash index
scan work tuple-at-a-time which means once it finds tuple on page, it
releases lock from the page but keeps pin on it and finally returns
the tuple. When the requested number of tuples are processed there is
no lock on the page that was being scanned but yes there is a pin on
it. Finally, when trying to close a cursor at the end of scan, if any
killed tuples has been identified we try to first mark these items as
dead with the help of _hash_kill_items(). But, since we only have pin
on this page, the assert check 'LWLockHeldByMe()' fails.

When scanning tuples using normal SELECT * statement, before moving to
next page in a bucket we first deal with all the killed items but we
do this without releasing lock and pin on the current page. Hence,
with SELECT queries this crash is not visible.

The attached patch fixes this. But, please note that all these changes
will get removed with the patch for page scan mode - [1].

[1] - https://www.postgresql.org/message-id/CA%2BTgmobYTvexcjqMhXoNCyEUHChzmdC_2xVGgj7eqaYVgoJA%2Bg%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Andreas Seltenreich
Date:
Ashutosh Sharma writes:
>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)
> Thanks for reporting this problem. Could you please let me know on for
> how long did you run sqlsmith to get this crash.

I got about one TRAP per hour when testing on 20 nodes with one postgres
and 5 sqlsmithes on each.  Nodes are tiny consumer machines with
low-power 4-core sandy bridges.

> [2. reacquire_lock_hashkillitems_if_required.patch]

I'll test with your patch applied as soon as time permits and report
back.

regards,
Andreas



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
On Mar 28, 2017 00:00, "Andreas Seltenreich" <seltenreich@gmx.de> wrote:
Ashutosh Sharma writes:
>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)
> Thanks for reporting this problem. Could you please let me know on for
> how long did you run sqlsmith to get this crash.

I got about one TRAP per hour when testing on 20 nodes with one postgres
and 5 sqlsmithes on each.  Nodes are tiny consumer machines with
low-power 4-core sandy bridges.

Okay. Thanks for sharing this information. I tried running running sqlsmith on a single node. I ran it for an hour but didn't find any crash. Let me also try running multiple sqlsmith at a time...May be 5 like you are doing on a single node.


> [2. reacquire_lock_hashkillitems_if_required.patch]

I'll test with your patch applied as soon as time permits and report
back.

Sure, Thanks a lot.

With Regards,
Ashutosh Sharma

Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 5:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Thanks for reporting this problem. Could you please let me know on for
> how long did you run sqlsmith to get this crash. However, I have found
> the reason for this crash. This is basically happening when trying to
> retrieve the tuples using cursor. Basically the current hash index
> scan work tuple-at-a-time which means once it finds tuple on page, it
> releases lock from the page but keeps pin on it and finally returns
> the tuple. When the requested number of tuples are processed there is
> no lock on the page that was being scanned but yes there is a pin on
> it. Finally, when trying to close a cursor at the end of scan, if any
> killed tuples has been identified we try to first mark these items as
> dead with the help of _hash_kill_items(). But, since we only have pin
> on this page, the assert check 'LWLockHeldByMe()' fails.

Instead of adding bool haveLock to _hash_kill_items, how about just
having the caller acquire the lock before calling the function and
release it afterwards?  Since the acquire is at the beginning of the
function and the release at the end, there seems to be no point in
putting the acquire/release inside the function rather than in the
caller.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
On Sat, Apr 1, 2017 at 1:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 27, 2017 at 5:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Thanks for reporting this problem. Could you please let me know on for
>> how long did you run sqlsmith to get this crash. However, I have found
>> the reason for this crash. This is basically happening when trying to
>> retrieve the tuples using cursor. Basically the current hash index
>> scan work tuple-at-a-time which means once it finds tuple on page, it
>> releases lock from the page but keeps pin on it and finally returns
>> the tuple. When the requested number of tuples are processed there is
>> no lock on the page that was being scanned but yes there is a pin on
>> it. Finally, when trying to close a cursor at the end of scan, if any
>> killed tuples has been identified we try to first mark these items as
>> dead with the help of _hash_kill_items(). But, since we only have pin
>> on this page, the assert check 'LWLockHeldByMe()' fails.
>
> Instead of adding bool haveLock to _hash_kill_items, how about just
> having the caller acquire the lock before calling the function and
> release it afterwards?  Since the acquire is at the beginning of the
> function and the release at the end, there seems to be no point in
> putting the acquire/release inside the function rather than in the
> caller.

Well, That is another option but the main idea was to be inline with
the btree code. Moreover, I am not sure if acquiring lwlock inside
hashendscan (basically the place where we are trying to close down the
things) would look good.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Robert Haas
Date:
On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Well, That is another option but the main idea was to be inline with
> the btree code.

That's not a bad goal in principal, but _bt_killitems() doesn't have
any similar argument.

(Also, speaking of consistency, why did we end up with
_hash_kill_items, with an underscore between kill and items, and
_bt_killitems, without one?)

> Moreover, I am not sure if acquiring lwlock inside
> hashendscan (basically the place where we are trying to close down the
> things) would look good.

Well, if that's not a good thing to do, hiding it inside some other
function doesn't make it better.  I think it's fine, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.

It was there but later got removed with some patch (may be the patch
for reducing pinning and buffer content lock for btree scans). The
following commit has this changes.

commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sun May 7 01:21:30 2006 +0000

>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)

That is just to follow the naming convention in hash.h Please check
the review comments for this at - [1].

>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.

okay, agreed. I will submit the patch very shortly.


[1] - https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
Hi,

On Sat, Apr 1, 2017 at 7:06 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>> Well, That is another option but the main idea was to be inline with
>>> the btree code.
>>
>> That's not a bad goal in principal, but _bt_killitems() doesn't have
>> any similar argument.
>
> It was there but later got removed with some patch (may be the patch
> for reducing pinning and buffer content lock for btree scans). The
> following commit has this changes.
>
> commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sun May 7 01:21:30 2006 +0000
>
>>
>> (Also, speaking of consistency, why did we end up with
>> _hash_kill_items, with an underscore between kill and items, and
>> _bt_killitems, without one?)
>
> That is just to follow the naming convention in hash.h Please check
> the review comments for this at - [1].
>
>>
>>> Moreover, I am not sure if acquiring lwlock inside
>>> hashendscan (basically the place where we are trying to close down the
>>> things) would look good.
>>
>> Well, if that's not a good thing to do, hiding it inside some other
>> function doesn't make it better.
>
> okay, agreed. I will submit the patch very shortly.

As suggested, I am now acquiring lock inside the caller function.
Attached is the patch having this changes. Thanks.

>
>
> [1] - https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Andreas Seltenreich
Date:
Andreas Seltenreich writes:
>>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)
> I got about one TRAP per hour when testing on 20 nodes with one postgres
> and 5 sqlsmithes on each.

> Ashutosh Sharma writes:
>> [2. reacquire_lock_hashkillitems_if_required.patch]

I've done about 12 hours of testing since applying this patch and no
failed assertions so far.

regards,
Andreas



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
Hi Andreas,

On Apr 1, 2017 16:15, "Andreas Seltenreich" <seltenreich@gmx.de> wrote:
Andreas Seltenreich writes:
>>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", File: "bufmgr.c", Line: 3397)
> I got about one TRAP per hour when testing on 20 nodes with one postgres
> and 5 sqlsmithes on each.

> Ashutosh Sharma writes:
>> [2. reacquire_lock_hashkillitems_if_required.patch]

I've done about 12 hours of testing since applying this patch and no
failed assertions so far.


Thanks for testing my patch. Just FYI, I have slightly changed my patch this morning as per Robert's suggestions. But, still more or less the logic remains the same. Thank you once again.

Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Amit Kapila
Date:
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.
>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)
>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.  I think it's fine, though.
>

One thing to note here is that this fix won't be required if we get
the page-scan-mode patch [1] in this CF.  I think if we fix this with
the patch proposed by Ashutosh, then we anyway needs to again change
the related code (kind of revert the fix) after page-scan-mode patch.
Now, if you think we have negligible chance of getting that patch,
then it is good to proceed with this fix.

[1] - https://commitfest.postgresql.org/13/999/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Ashutosh Sharma
Date:
On Apr 1, 2017 18:11, "Amit Kapila" <amit.kapila16@gmail.com> wrote:
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.
>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)
>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.  I think it's fine, though.
>

One thing to note here is that this fix won't be required if we get
the page-scan-mode patch [1] in this CF.  I think if we fix this with
the patch proposed by Ashutosh, then we anyway needs to again change
the related code (kind of revert the fix) after page-scan-mode patch.
Now, if you think we have negligible chance of getting that patch,
then it is good to proceed with this fix.

Yes, I had already mentioned this point in my very first mail. Thanks for highlighting this once again.

Re: [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

From
Robert Haas
Date:
On Fri, Mar 31, 2017 at 10:02 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> As suggested, I am now acquiring lock inside the caller function.
> Attached is the patch having this changes. Thanks.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company