Thread: Using Valgrind to detect faulty buffer accesses (no pin or buffercontent lock held)
Using Valgrind to detect faulty buffer accesses (no pin or buffercontent lock held)
From
Peter Geoghegan
Date:
I recently expressed an interest in using Valgrind memcheck to detect access to pages whose buffers do not have a pin held in the backend, or do not have a buffer lock held (the latter check makes sense for pages owned by index access methods). I came up with a quick and dirty patch, that I confirmed found a bug in nbtree VACUUM that I spotted randomly: https://postgr.es/m/CAH2-Wz=WRu6NMWtit2weDnuGxdsWeNyFygeBP_zZ2Sso0YAGFg@mail.gmail.com (This is a bug in commit 857f9c36cda.) Alvaro wrote a similar patch back in 2015, that I'd forgotten about but was reminded of today: https://postgr.es/m/20150723195349.GW5596@postgresql.org I took his version (which was better than my rough original) and rebased it -- that's attached as the first patch. The second patch is something that takes the general idea further by having nbtree mark pages whose buffers lack a buffer lock (that may or may not have a buffer pin) as NOACCESS in a similar way. This second patch detected two more bugs in nbtree page deletion by running the regression tests with Valgrind memcheck. These additional bugs are probably of lower severity than the first one, since we at least have a buffer pin (we just don't have buffer locks). All three bugs are very similar, though: they all involve dereferencing a pointer to the special area of a page at a point where the underlying buffer is no longer safe to access. The final two patches fix the two newly discovered bugs -- I don't have a fix for the first bug yet, since that one is more complicated (and probably more serious). The regression tests run with Valgrind will complain about all three bugs if you just apply the first two patches (though you only need the first patch to see a complaint about the first, more serious bug when the tests are run). -- Peter Geoghegan
Attachment
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffercontent lock held)
From
Peter Geoghegan
Date:
On Fri, Apr 24, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > The final two patches fix the two newly discovered bugs -- I don't > have a fix for the first bug yet, since that one is more complicated > (and probably more serious). I pushed both of the two fixes that I posted yesterday -- fixes for the benign "no buffer lock held" issues in nbtree page deletion. We still need a fix for the more serious issue, though. And that will need to be backpatched down to v11. I'll try to get to that next week. Attached is v2 of the patch series. This version has a centralized description of what we require from nbtree code above _bt_getbuf(), alongside existing, similar commentary. Here's the general idea: If we lock a buffer, that has to go through one of the wrapper functions that knows about Valgrind in all cases. It's presented as a stricter version of what happens in bufmgr.c for all buffers from all access methods. I also added something about how the nbtree Valgrind client requests (those added by the second patch in the series) assume that the bufmgr.c client requests (from the first patch) also take place. We need to be able to rely on bufmgr.c to "clean up" in the event of an error, so the second patch has a strict dependency on the first. If the transaction aborts, we can rely on bufmgr.c marking the buffer's page as defined when the backend acquires its first buffer pin on the buffer in the next transaction (doesn't matter whether or not the same block/page is in the same buffer as before). This is why we can't use client requests for local buffers (not that we'd want to; the existing aset.c Valgrind client requests take care of them automatically). -- Peter Geoghegan
Attachment
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Daniel Gustafsson
Date:
> On 26 Apr 2020, at 02:17, Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v2 of the patch series. This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please submit a rebased version? cheers ./daniel
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Peter Geoghegan
Date:
On Thu, Jul 2, 2020 at 7:48 AM Daniel Gustafsson <daniel@yesql.se> wrote: > This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please > submit a rebased version? I attach the rebased patch series. Thanks -- Peter Geoghegan
Attachment
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Georgios Kokolatos
Date:
As a general overview, the series of patches in the mail thread do match their description. The addition of the stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non nativespeaker. Unfortunately though, the two bug fixes do not seem to apply. Also, there is a small issue regarding the process, not the content of the patches. In CF app there is a latest attachment (v3-0002-Add-nbtree-Valgrind-buffer-lock-checks.patch) which does not appear in the mail thread. Before changingthe status, I will kindly ask for the complete latest series that applies in the mail thread. The new status of this patch is: Waiting on Author
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Anastasia Lubennikova
Date:
On 02.07.2020 20:11, Peter Geoghegan wrote: > On Thu, Jul 2, 2020 at 7:48 AM Daniel Gustafsson <daniel@yesql.se> wrote: >> This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please >> submit a rebased version? > I attach the rebased patch series. > > Thanks It's impressive that this check helped to find several bugs. I only noticed small inconsistency in the new comment for _bt_conditionallockbuf(). It says "Note: Caller is responsible for calling _bt_checkpage() on success.", while in _bt_getbuf() the call is not followed by _bt_checkpage(). Moreover, _bt_page_recyclable() contradicts _bt_checkpage() checks. Other than that, patches look good to me, so move them to "Ready For Committer". Are you planning to add same checks for other access methods? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Peter Geoghegan
Date:
On Thu, Jul 16, 2020 at 10:24 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > It's impressive that this check helped to find several bugs. While it's definitely true that it *could* have detected the bug fixed by commit b0229f26, it's kind of debatable whether or not the bugs I fixed in commit fa7ff642 and commit 7154aa16 (which actually were found using this new instrumentation) were truly bugs. The behavior in question was probably safe, since only the special/opaque page area was accessed -- and with at least a buffer pin held. But it's not worth having a debate about whether or not it should be considered safe. There is no downside to not having a simple strict rule that's easy to enforce. Also, I myself spotted some bugs in the skip scan patch series at one point that would also be caught by the new instrumentation. > I only noticed small inconsistency in the new comment for > _bt_conditionallockbuf(). > > It says "Note: Caller is responsible for calling _bt_checkpage() on > success.", while in _bt_getbuf() the call is not followed by > _bt_checkpage(). > Moreover, _bt_page_recyclable() contradicts _bt_checkpage() checks. Nice catch. > Other than that, patches look good to me, so move them to "Ready For > Committer". Pushed the first patch just now, and intend to push the other one soon. Thanks! > Are you planning to add same checks for other access methods? Not at the moment, but it does seem like my approach could be generalized to other index access methods. -- Peter Geoghegan
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > On Thu, Jul 16, 2020 at 10:24 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Other than that, patches look good to me, so move them to "Ready For >> Committer". > Pushed the first patch just now, and intend to push the other one soon. Thanks! I wonder whether skink's failure today is due to this change: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2020-07-18%2018%3A01%3A10 regards, tom lane
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Peter Geoghegan
Date:
On Sat, Jul 18, 2020 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder whether skink's failure today is due to this change: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2020-07-18%2018%3A01%3A10 That seems extremely likely. I think that I need to do something like what you see in the attached. Anyway, I'll take care of it tomorrow. Sorry for missing it before my commit. -- Peter Geoghegan
Attachment
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Peter Geoghegan
Date:
On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos <gkokolatos@protonmail.com> wrote: > As a general overview, the series of patches in the mail thread do match their description. The addition of the stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non nativespeaker. I didn't see this review until now because it ended up in gmail's spam folder. :-( Thanks for taking a look at it! -- Peter Geoghegan
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Peter Geoghegan
Date:
On Fri, Jul 17, 2020 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > Pushed the first patch just now, and intend to push the other one soon. Thanks! Pushed the second piece of this (the nbtree patch) just now. Thanks for the review! -- Peter Geoghegan
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
From
Georgios
Date:
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, July 21, 2020 11:52 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos > gkokolatos@protonmail.com wrote: > > > As a general overview, the series of patches in the mail thread do match their description. The addition of the stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non nativespeaker. > > I didn't see this review until now because it ended up in gmail's spam > folder. :-( > > Thanks for taking a look at it! No worries at all. It happens and it was beneficial for me to read the patch. //Georgios > > ---------------------------------------------------------------------------------------------------------------------- > > Peter Geoghegan