Thread: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
"feichanghong"
Date:
Hi,

Environmental information is as follows:
PostgreSQL Version: 17, branch: master, commit: abb0b4fc
Operating system: centos 7, Kernel version: 5.10.13
Client: psql

After the instance crash recovery, I encountered an assertion failure with
MarkBufferDirty when inserting data into a gin index. The specific stack trace
is as follows:
```
#0  0x00007feb10d19277 in raise () from /lib64/libc.so.6
#1  0x00007feb10d1a968 in abort () from /lib64/libc.so.6
#2  0x00000000009a4184 in ExceptionalCondition (
    conditionName=conditionName@entry=0xb44e38 
    "LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE)",
    fileName=fileName@entry=0xb45594 "bufmgr.c",
    lineNumber=lineNumber@entry=2216) at assert.c:66
#3  0x0000000000840da0 in MarkBufferDirty (buffer=buffer@entry=865) at bufmgr.c:2216
#4  0x000000000052f953 in ginPlaceToPage (btree=0x7fff6a3316b0, stack=0x1e5cbd0,
    insertdata=0x1e31ce8, updateblkno=626, childbuf=865, buildStats=0x0)
    at ginbtree.c:407
#5  0x000000000053084b in ginFinishSplit (btree=0x7fff6a3316b0, stack=0x1e5bac8,
    freestack=false, buildStats=0x0) at ginbtree.c:738
#6  0x000000000053103f in ginFindLeafPage (btree=btree@entry=0x7fff6a3316b0,
    searchMode=searchMode@entry=false, rootConflictCheck=rootConflictCheck@entry=false)
    at ginbtree.c:111
#7  0x000000000053b850 in ginEntryInsert (ginstate=ginstate@entry=0x1e1f708, attnum=1,
    key=31597248, category=0 '\000', items=0x1e222d8, nitem=1, buildStats=0x0)
    at gininsert.c:195
#8  0x00000000005373f2 in ginInsertCleanup (ginstate=ginstate@entry=0x1e1f708,
    full_clean=full_clean@entry=false, fill_fsm=fill_fsm@entry=true,
    forceCleanup=forceCleanup@entry=false, stats=stats@entry=0x0) at ginfast.c:930
#9  0x0000000000537f2a in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x1e1f708,
    collector=collector@entry=0x7fff6a331ab0) at ginfast.c:471
```

In the function ginFindLeafPage, if we encounter a page marked with
GIN_INCOMPLETE_SPLIT, we call ginFinishSplit to finish the incomplete split and
remove the GIN_INCOMPLETE_SPLIT flag from that page. ginFinishSplit requires
that "On entry, stack->buffer is exclusively locked," as explained in its
comments.

The function ginFindLeafPage determines the type of lock held by ginTraverseLock:
leaf pages hold GIN_EXCLUSIVE, and internal page hold GIN_SHARE. If ginFindLeafPage
traverses to an internal page with an incomplete split, it will only hold a
GIN_SHARE lock, which does not meet the requirements of ginFinishSplit. If a
crash occurs when an internal page is split, but no downlink is inserted in the
parent page, this problem may occur.

I injected the following code into the ginbtree.c file to trigger a crash
during the splitting of an internal page:
```
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -621,6 +621,12 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
                 PageSetLSN(BufferGetPage(lbuffer), recptr);
             if (BufferIsValid(childbuf))
                 PageSetLSN(childpage, recptr);
+
+            if (stack->parent != NULL && !GinPageIsLeaf(newlpage))
+            {
+                XLogFlush(recptr);
+                elog(PANIC, "internal page split for block: %u", stack->blkno);
+            }
         }
         END_CRIT_SECTION();
```

With the modifications above, the problem can be reproduced with the following
steps:
```
alter system set autovacuum to off;
alter system set gin_pending_list_limit to 64;
select pg_reload_conf();
create table t (a text);
create extension btree_gin;
create index on t USING gin (a);
-- PANIC crash
insert into t select generate_series(1, 100000);

-- Assert fail
insert into t select 1;
```

I reviewed all places where ginFinishSplit is called, and only in two instances
in ginFindLeafPage might it be possible to not hold an exclusive lock on
stack->buffer.

The patch I provided in the attachment has been verified to fix the issue
mentioned above.
However, it may not be perfect: passing GIN_EXCLUSIVE to ginStepRight might
affect performance. Do we need to add a parameter to ginStepRight, and within
the function ginStepRight, elevate the lock level for an incomplete split page?

Looking forward to suggestions from developers!


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
 
Attachment

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Heikki Linnakangas
Date:
On 22/01/2024 09:59, feichanghong wrote:
> In the function ginFindLeafPage, if we encounter a page marked with
> GIN_INCOMPLETE_SPLIT, we call ginFinishSplit to finish the incomplete 
> split and
> remove the GIN_INCOMPLETE_SPLIT flag from that page. ginFinishSplit requires
> that "On entry, stack->buffer is exclusively locked," as explained in its
> comments.
> 
> The function ginFindLeafPage determines the type of lock held by 
> ginTraverseLock:
> leaf pages hold GIN_EXCLUSIVE, and internal page hold GIN_SHARE. If 
> ginFindLeafPage
> traverses to an internal page with an incomplete split, it will only hold a
> GIN_SHARE lock, which does not meet the requirements of ginFinishSplit. If a
> crash occurs when an internal page is split, but no downlink is inserted 
> in the
> parent page, this problem may occur.
> 
> I injected the following code into the ginbtree.c file to trigger a crash
> during the splitting of an internal page:
> ```
> --- a/src/backend/access/gin/ginbtree.c
> +++ b/src/backend/access/gin/ginbtree.c
> @@ -621,6 +621,12 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
>                   PageSetLSN(BufferGetPage(lbuffer), recptr);
>               if (BufferIsValid(childbuf))
>                   PageSetLSN(childpage, recptr);
> +
> +            if (stack->parent != NULL && !GinPageIsLeaf(newlpage))
> +            {
> +                XLogFlush(recptr);
> +                elog(PANIC, "internal page split for block: %u", 
> stack->blkno);
> +            }
>           }
>           END_CRIT_SECTION();
> ```
> 
> With the modifications above, the problem can be reproduced with the 
> following
> steps:
> ```
> alter system set autovacuum to off;
> alter system set gin_pending_list_limit to 64;
> select pg_reload_conf();
> create table t (a text);
> create extension btree_gin;
> create index on t USING gin (a);
> -- PANIC crash
> insert into t select generate_series(1, 100000);
> 
> -- Assert fail
> insert into t select 1;
> ```
> 
> I reviewed all places where ginFinishSplit is called, and only in two 
> instances
> in ginFindLeafPage might it be possible to not hold an exclusive lock on
> stack->buffer.
> 
> The patch I provided in the attachment has been verified to fix the issue
> mentioned above.
> However, it may not be perfect: passing GIN_EXCLUSIVE to ginStepRight might
> affect performance. Do we need to add a parameter to ginStepRight, and 
> within
> the function ginStepRight, elevate the lock level for an incomplete 
> split page?
> 
> Looking forward to suggestions from developers!

Thanks, I'll look into this. The fix seems fine at a quick glance, but 
I'll think about the performance aspect a bit more.

I'm thinking of adding tests for this using the new fault-injection 
facility that Michael just merged in commit d86d20f0ba. For GIN, but 
also B-tree which has a similar incomplete-split mechanism.

Another way to create a scenario with incomplete splits, which doesn't 
involve any crashes or errors, would be to perform PITR to just between 
the insert and the finish-split records. But the fault-injection seems 
easier.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Thank you for your attention.

Thanks, I'll look into this. The fix seems fine at a quick glance, but I'll think about the performance aspect a bit more.
I apologize for the mistake in my patch: "if GinPageIsIncompleteSplit(page)" is
missing a parenthesis, it should be "if (GinPageIsIncompleteSplit(page))"

Another way to create a scenario with incomplete splits, which doesn't involve any crashes or errors, would be to perform PITR to just between the insert and the finish-split records. But the fault-injection seems easier.
I agree it, fault-injection is easier than PITR.

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Thank you for your attention.

Thanks, I'll look into this. The fix seems fine at a quick glance, but I'll think about the performance aspect a bit more.
I apologize for the mistake in my patch: "if GinPageIsIncompleteSplit(page)" is
missing a parenthesis, it should be "if (GinPageIsIncompleteSplit(page))"

Another way to create a scenario with incomplete splits, which doesn't involve any crashes or errors, would be to perform PITR to just between the insert and the finish-split records. But the fault-injection seems easier.
I agree it, fault-injection is easier than PITR.

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Heikki Linnakangas
Date:
(Adding Michael for the fault-injection parts)

On 22/01/2024 17:27, feichanghong wrote:
> Thank you for your attention.
> 
>> Thanks, I'll look into this. The fix seems fine at a quick glance, but 
>> I'll think about the performance aspect a bit more.

 From a performance point of view, this doesn't matter. Incomplete split 
are extremely rare. For convenience, though, I added a new function 
specifically for handling these "leftover" incomplete splits as opposed 
to finishing a split that you just made, which performs the 
lock-upgrade. See attached. I think that helps with readability, and 
makes it less likely that we'll forget the lock-upgrade in the future if 
the insertion code is refactored.

> I apologize for the mistake in my patch: "if 
> GinPageIsIncompleteSplit(page)" is
> missing a parenthesis, it should be "if (GinPageIsIncompleteSplit(page))"
> 
>> Another way to create a scenario with incomplete splits, which doesn't 
>> involve any crashes or errors, would be to perform PITR to just 
>> between the insert and the finish-split records. But the 
>> fault-injection seems easier.
> I agree it, fault-injection is easier than PITR.

The attached patch contains a test case using the fault-injection facility.

Michael, it was a pleasure to write this test with the injection points, 
compared to trying to set up PITR at just the right moment. Thank you! 
Since this is the first test that uses it, I didn't have any precedence 
to copy-paste; can you take a look and verify if this is how you 
imagined the facility to be used?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

From a performance point of view, this doesn't matter. Incomplete split are extremely rare. For convenience, though, I added a new function specifically for handling these "leftover" incomplete splits as opposed to finishing a split that you just made, which performs the lock-upgrade. See attached. I think that helps with readability, and makes it less likely that we'll forget the lock-upgrade in the future if the insertion code is refactored.
I think that the lock-upgrade in the ginFinishOldSplit function is unsafe
because it violates the requirement of the ginStepRight function that
"The next page is locked first, before releasing the current page.”


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Heikki Linnakangas
Date:
On 22/01/2024 18:21, feichanghong wrote:
> 
>> From a performance point of view, this doesn't matter. Incomplete 
>> split are extremely rare. For convenience, though, I added a new 
>> function specifically for handling these "leftover" incomplete splits 
>> as opposed to finishing a split that you just made, which performs the 
>> lock-upgrade. See attached. I think that helps with readability, and 
>> makes it less likely that we'll forget the lock-upgrade in the future 
>> if the insertion code is refactored.
> I think that the lock-upgrade in the ginFinishOldSplit function is unsafe
> because it violates the requirement of the ginStepRight function that
> "The next page is locked first, before releasing the current page.”

Good point.

I started to work on a more invasive patch that would move the 
ginFinishOldSplit() calls to ginTraverseLock() and ginStepRight(), doing 
the interlocking correctly. That makes life easier for the callers; they 
don't need to deal with incomplete-splits anymore.

But then I read the Page deletion section in the README and understood 
that my earlier patch is safe, after all. The lock-coupling in 
ginStepRight() is only needed for searches, not for inserts. There is 
another mechanism that prevents concurrent page deletions during an 
insert: VACUUM holds a cleanup-lock on the root page.

Does that make sense, or am I missing something? Here's the same patch 
as before, with some extra comments to explain why it's safe.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Michael Paquier
Date:
On Mon, Jan 22, 2024 at 10:47:17PM +0200, Heikki Linnakangas wrote:
> Michael, it was a pleasure to write this test with the injection points,
> compared to trying to set up PITR at just the right moment. Thank you! Since
> this is the first test that uses it, I didn't have any precedence to
> copy-paste; can you take a look and verify if this is how you imagined the
> facility to be used?

I was wondering how many weeks it would take for somebody to begin
playing with that, and here you are 12 hours later..

+EXTRA_INSTALL = src/test/modules/injection_points
[...]
+#ifdef USE_INJECTION_POINTS
+        if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+            INJECTION_POINT("gin-leave-leaf-split-incomplete");
+        else
+            INJECTION_POINT("gin-leave-internal-split-incomplete");
+#endif

Yes, that would be one way of doing it.  It makes the code a bit more
invasive but that would work.  Now, you expect this point to be
conditional depending on the state of the buffer.  One thing I was
thinking of was to extend the APIs in injection_point.c to be able to
handle a set of arguments as a set of (void*) so as callbacks could
internally decide what they want to do depending on the running state.
I didn't have a use for it, but well, you do, so perhaps we could just
bite the bullet and do it.

That reduces the footprint on the backend code, moving more logic
behind USE_INJECTION_POINTS in injection_point.c.  At the end, you
should be able to patch the core backend with something like that,
without using some ifdefs on USE_INJECTION_POINTS:
INJECTION_POINT_1ARG("gin-leave-leaf-split", (void *) buffer)

And INJECTION_POINT_1ARG() would be just a wrapper around something
like:
extern void InjectionPointRun1(const char *name, void *arg1);

And then plug in the callback you want.  Having the callback in a new
test module with the SQL test itself would be OK, or you could just
add it to src/test/modules/injection_points/ with a SQL test in it, or
just use EXTRA_INSTALL.
--
Michael

Attachment

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Michael Paquier
Date:
On Tue, Jan 23, 2024 at 08:38:09AM +0900, Michael Paquier wrote:
> And INJECTION_POINT_1ARG() would be just a wrapper around something
> like:
> extern void InjectionPointRun1(const char *name, void *arg1);
>
> And then plug in the callback you want.  Having the callback in a new
> test module with the SQL test itself would be OK, or you could just
> add it to src/test/modules/injection_points/ with a SQL test in it, or
> just use EXTRA_INSTALL.

That would mean to extend the code like the attached, I think.  Would
that be useful for you?
--
Michael

Attachment

On Jan 23, 2024, at 04:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 22/01/2024 18:21, feichanghong wrote:
From a performance point of view, this doesn't matter. Incomplete split are extremely rare. For convenience, though, I added a new function specifically for handling these "leftover" incomplete splits as opposed to finishing a split that you just made, which performs the lock-upgrade. See attached. I think that helps with readability, and makes it less likely that we'll forget the lock-upgrade in the future if the insertion code is refactored.
I think that the lock-upgrade in the ginFinishOldSplit function is unsafe
because it violates the requirement of the ginStepRight function that
"The next page is locked first, before releasing the current page.”

Good point.

I started to work on a more invasive patch that would move the ginFinishOldSplit() calls to ginTraverseLock() and ginStepRight(), doing the interlocking correctly. That makes life easier for the callers; they don't need to deal with incomplete-splits anymore.

But then I read the Page deletion section in the README and understood that my earlier patch is safe, after all. The lock-coupling in ginStepRight() is only needed for searches, not for inserts. There is another mechanism that prevents concurrent page deletions during an insert: VACUUM holds a cleanup-lock on the root page.

Does that make sense, or am I missing something? Here's the same patch as before, with some extra comments to explain why it's safe.

From my understanding, you are right. The README includes the following
explanation:
ginScanToDelete() traverses the whole tree in depth-first manner. It starts
from the full cleanup lock on the tree root.  This lock prevents all the
concurrent insertions into this tree while we're deleting pages. However,
there are still might be some in-progress readers, who traversed root before
we locked it.

For insert, ginFindLeafPage maintains a pin on the relevant pages along the
path, naturally retaining the pin on the root page as well. Before calling
ginScanToDelete to delete a page, it secures an exclusive lock on the root page
and no other backend holds a pin on the root page through LockBufferForCleanup.

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Heikki Linnakangas
Date:
On 23/01/2024 01:38, Michael Paquier wrote:
> On Mon, Jan 22, 2024 at 10:47:17PM +0200, Heikki Linnakangas wrote:
>> Michael, it was a pleasure to write this test with the injection points,
>> compared to trying to set up PITR at just the right moment. Thank you! Since
>> this is the first test that uses it, I didn't have any precedence to
>> copy-paste; can you take a look and verify if this is how you imagined the
>> facility to be used?
> 
> I was wondering how many weeks it would take for somebody to begin
> playing with that, and here you are 12 hours later..

:-D

> +EXTRA_INSTALL = src/test/modules/injection_points
> [...]
> +#ifdef USE_INJECTION_POINTS
> +        if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
> +            INJECTION_POINT("gin-leave-leaf-split-incomplete");
> +        else
> +            INJECTION_POINT("gin-leave-internal-split-incomplete");
> +#endif
> 
> Yes, that would be one way of doing it.  It makes the code a bit more
> invasive but that would work.  Now, you expect this point to be
> conditional depending on the state of the buffer.  One thing I was
> thinking of was to extend the APIs in injection_point.c to be able to
> handle a set of arguments as a set of (void*) so as callbacks could
> internally decide what they want to do depending on the running state.
> I didn't have a use for it, but well, you do, so perhaps we could just
> bite the bullet and do it.
> 
> That reduces the footprint on the backend code, moving more logic
> behind USE_INJECTION_POINTS in injection_point.c.  At the end, you
> should be able to patch the core backend with something like that,
> without using some ifdefs on USE_INJECTION_POINTS:
> INJECTION_POINT_1ARG("gin-leave-leaf-split", (void *) buffer)
> 
> And INJECTION_POINT_1ARG() would be just a wrapper around something
> like:
> extern void InjectionPointRun1(const char *name, void *arg1);

I can see that being useful in general. It requires a bespoken callback 
function, though. One nice thing about this test was that I didn't need 
to write a new C module, only that snippet at the injection point and 
some SQL. In more complicated tests a new C module is warranted, but I 
think a lot of tests - like this gin test - can be written without it.

Perhaps if the argument was uint64 instead of a pointer, so that the 
'injection_point' module could contain ready-made functions that do 
things with the argument.

Overall I feel we should not bother with injection point arguments for 
now, though. Let's wait until we have a few more tests that would 
benefit from that, and then we'll have more information on what the 
ideal interface would look like.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Michael Paquier
Date:
On Tue, Jan 23, 2024 at 09:12:25AM +0200, Heikki Linnakangas wrote:
> I can see that being useful in general. It requires a bespoken callback
> function, though. One nice thing about this test was that I didn't need to
> write a new C module, only that snippet at the injection point and some SQL.
> In more complicated tests a new C module is warranted, but I think a lot of
> tests - like this gin test - can be written without it.

FWIW, I don't mind if src/test/modules/injection_points/ is used as a
single point across the tree as a library storing a large set of
callbacks.  All these don't have to be generic, and I think that we
should remain a maximum flexible for bug fixing and advanced testing.
I'm OK to leave the call up to the committer because it comes down to
invasiveness in src/backend/ vs invasiveness in the number of
callbacks or the logic inside the callbacks (sometimes it makes sense
to me to also complicate more the backend), and I'm OK with what you
are doing here for this thread.

> Perhaps if the argument was uint64 instead of a pointer, so that the
> 'injection_point' module could contain ready-made functions that do things
> with the argument.

I am not exactly sure to understand what you mean here.

> Overall I feel we should not bother with injection point arguments for now,
> though. Let's wait until we have a few more tests that would benefit from
> that, and then we'll have more information on what the ideal interface would
> look like.

Yeah, I am not sure if that's worth having for now, but that's always
good to keep in mind that the infra can be extended to handle such
cases even for stable branches.
--
Michael

Attachment

Re: MarkBufferDirty Assert held LW_EXCLUSIVE lock fail when ginFinishSplit

From
Heikki Linnakangas
Date:
On 23/01/2024 05:39, feichanghong wrote:
> 
>> On Jan 23, 2024, at 04:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 22/01/2024 18:21, feichanghong wrote:
>>>> From a performance point of view, this doesn't matter. Incomplete 
>>>> split are extremely rare. For convenience, though, I added a new 
>>>> function specifically for handling these "leftover" incomplete 
>>>> splits as opposed to finishing a split that you just made, which 
>>>> performs the lock-upgrade. See attached. I think that helps with 
>>>> readability, and makes it less likely that we'll forget the 
>>>> lock-upgrade in the future if the insertion code is refactored.
>>> I think that the lock-upgrade in the ginFinishOldSplit function is unsafe
>>> because it violates the requirement of the ginStepRight function that
>>> "The next page is locked first, before releasing the current page.”
>>
>> Good point.
>>
>> I started to work on a more invasive patch that would move the 
>> ginFinishOldSplit() calls to ginTraverseLock() and ginStepRight(), 
>> doing the interlocking correctly. That makes life easier for the 
>> callers; they don't need to deal with incomplete-splits anymore.
>>
>> But then I read the Page deletion section in the README and understood 
>> that my earlier patch is safe, after all. The lock-coupling in 
>> ginStepRight() is only needed for searches, not for inserts. There is 
>> another mechanism that prevents concurrent page deletions during an 
>> insert: VACUUM holds a cleanup-lock on the root page.
>>
>> Does that make sense, or am I missing something? Here's the same patch 
>> as before, with some extra comments to explain why it's safe.
> 
>  From my understanding, you are right. The README includes the following
> explanation:
> ginScanToDelete() traverses the whole tree in depth-first manner. It starts
> from the full cleanup lock on the tree root.  This lock prevents all the
> concurrent insertions into this tree while we're deleting pages. However,
> there are still might be some in-progress readers, who traversed root before
> we locked it.
> 
> For insert, ginFindLeafPage maintains a pin on the relevant pages along the
> path, naturally retaining the pin on the root page as well. Before calling
> ginScanToDelete to delete a page, it secures an exclusive lock on the 
> root page
> and no other backend holds a pin on the root page through 
> LockBufferForCleanup.

I have pushed this fix. Thanks for the report and the review!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Dear hlinnaka,

I once attempted to construct a scenario with multiple backends concurrently
performing the split that leads to gin index corruption. Unfortunately, I did
not succeed. The reason is that ginInsertCleanup requires an Exclusive Lock
on METAPAGE, preventing multiple processes from simultaneously executing the
cleanup. 

Maybe I missed something. hlinnaka, do you have any suggestions?

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.