Thread: Using Valgrind to detect faulty buffer accesses (no pin or buffercontent lock held)

I recently expressed an interest in using Valgrind memcheck to detect
access to pages whose buffers do not have a pin held in the backend,
or do not have a buffer lock held (the latter check makes sense for
pages owned by index access methods). I came up with a quick and dirty
patch, that I confirmed found a bug in nbtree VACUUM that I spotted
randomly:

https://postgr.es/m/CAH2-Wz=WRu6NMWtit2weDnuGxdsWeNyFygeBP_zZ2Sso0YAGFg@mail.gmail.com

(This is a bug in commit 857f9c36cda.)

Alvaro wrote a similar patch back in 2015, that I'd forgotten about
but was reminded of today:

https://postgr.es/m/20150723195349.GW5596@postgresql.org

I took his version (which was better than my rough original) and
rebased it -- that's attached as the first patch. The second patch is
something that takes the general idea further by having nbtree mark
pages whose buffers lack a buffer lock (that may or may not have a
buffer pin) as NOACCESS in a similar way.

This second patch detected two more bugs in nbtree page deletion by
running the regression tests with Valgrind memcheck. These additional
bugs are probably of lower severity than the first one, since we at
least have a buffer pin (we just don't have buffer locks). All three
bugs are very similar, though: they all involve dereferencing a
pointer to the special area of a page at a point where the underlying
buffer is no longer safe to access.

The final two patches fix the two newly discovered bugs -- I don't
have a fix for the first bug yet, since that one is more complicated
(and probably more serious). The regression tests run with Valgrind
will complain about all three bugs if you just apply the first two
patches (though you only need the first patch to see a complaint about
the first, more serious bug when the tests are run).

-- 
Peter Geoghegan

Attachment
On Fri, Apr 24, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The final two patches fix the two newly discovered bugs -- I don't
> have a fix for the first bug yet, since that one is more complicated
> (and probably more serious).

I pushed both of the two fixes that I posted yesterday -- fixes for
the benign "no buffer lock held" issues in nbtree page deletion. We
still need a fix for the more serious issue, though. And that will
need to be backpatched down to v11. I'll try to get to that next week.

Attached is v2 of the patch series. This version has a centralized
description of what we require from nbtree code above _bt_getbuf(),
alongside existing, similar commentary. Here's the general idea: If we
lock a buffer, that has to go through one of the wrapper functions
that knows about Valgrind in all cases. It's presented as a stricter
version of what happens in bufmgr.c for all buffers from all access
methods.

I also added something about how the nbtree Valgrind client requests
(those added by the second patch in the series) assume that the
bufmgr.c client requests (from the first patch) also take place. We
need to be able to rely on bufmgr.c to "clean up" in the event of an
error, so the second patch has a strict dependency on the first. If
the transaction aborts, we can rely on bufmgr.c marking the buffer's
page as defined when the backend acquires its first buffer pin on the
buffer in the next transaction (doesn't matter whether or not the same
block/page is in the same buffer as before). This is why we can't use
client requests for local buffers (not that we'd want to; the existing
aset.c Valgrind client requests take care of them automatically).

-- 
Peter Geoghegan

Attachment
> On 26 Apr 2020, at 02:17, Peter Geoghegan <pg@bowt.ie> wrote:

> Attached is v2 of the patch series.

This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please
submit a rebased version?

cheers ./daniel


On Thu, Jul 2, 2020 at 7:48 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please
> submit a rebased version?

I attach the rebased patch series.

Thanks
-- 
Peter Geoghegan

Attachment
As a general overview, the series of patches in the mail thread do match their description. The addition of the
stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or
alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non
nativespeaker.
 

Unfortunately though, the two bug fixes do not seem to apply.

Also, there is a small issue regarding the process, not the content of the patches. In CF app there is a latest
attachment (v3-0002-Add-nbtree-Valgrind-buffer-lock-checks.patch)  which does not appear in the mail thread.  Before
changingthe status, I will kindly ask for the complete latest series that applies in the mail thread. 

The new status of this patch is: Waiting on Author

Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)

From
Anastasia Lubennikova
Date:
On 02.07.2020 20:11, Peter Geoghegan wrote:
> On Thu, Jul 2, 2020 at 7:48 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> This patch fails to apply to HEAD due to conflicts in nbtpage.c, can you please
>> submit a rebased version?
> I attach the rebased patch series.
>
> Thanks

It's impressive that this check helped to find several bugs.

I only noticed small inconsistency in the new comment for 
_bt_conditionallockbuf().

It says "Note: Caller is responsible for calling _bt_checkpage() on 
success.", while in _bt_getbuf() the call is not followed by 
_bt_checkpage().
Moreover, _bt_page_recyclable() contradicts _bt_checkpage() checks.

Other than that, patches look good to me, so move them to "Ready For 
Committer".

Are you planning to add same checks for other access methods?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




On Thu, Jul 16, 2020 at 10:24 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> It's impressive that this check helped to find several bugs.

While it's definitely true that it *could* have detected the bug fixed
by commit b0229f26, it's kind of debatable whether or not the bugs I
fixed in commit fa7ff642 and commit 7154aa16 (which actually were
found using this new instrumentation) were truly bugs.

The behavior in question was probably safe, since only the
special/opaque page area was accessed -- and with at least a buffer
pin held. But it's not worth having a debate about whether or not it
should be considered safe. There is no downside to not having a simple
strict rule that's easy to enforce. Also, I myself spotted some bugs
in the skip scan patch series at one point that would also be caught
by the new instrumentation.

> I only noticed small inconsistency in the new comment for
> _bt_conditionallockbuf().
>
> It says "Note: Caller is responsible for calling _bt_checkpage() on
> success.", while in _bt_getbuf() the call is not followed by
> _bt_checkpage().
> Moreover, _bt_page_recyclable() contradicts _bt_checkpage() checks.

Nice catch.

> Other than that, patches look good to me, so move them to "Ready For
> Committer".

Pushed the first patch just now, and intend to push the other one soon. Thanks!

> Are you planning to add same checks for other access methods?

Not at the moment, but it does seem like my approach could be
generalized to other index access methods.

--
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jul 16, 2020 at 10:24 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Other than that, patches look good to me, so move them to "Ready For
>> Committer".

> Pushed the first patch just now, and intend to push the other one soon. Thanks!

I wonder whether skink's failure today is due to this change:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2020-07-18%2018%3A01%3A10

            regards, tom lane



On Sat, Jul 18, 2020 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wonder whether skink's failure today is due to this change:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2020-07-18%2018%3A01%3A10

That seems extremely likely. I think that I need to do something like
what you see in the attached.

Anyway, I'll take care of it tomorrow. Sorry for missing it before my commit.

-- 
Peter Geoghegan

Attachment
On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos
<gkokolatos@protonmail.com> wrote:
> As a general overview, the series of patches in the mail thread do match their description. The addition of the
stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or
alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non
nativespeaker. 

I didn't see this review until now because it ended up in gmail's spam
folder. :-(

Thanks for taking a look at it!

--
Peter Geoghegan



On Fri, Jul 17, 2020 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Pushed the first patch just now, and intend to push the other one soon. Thanks!

Pushed the second piece of this (the nbtree patch) just now.

Thanks for the review!
-- 
Peter Geoghegan




‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, July 21, 2020 11:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:

> On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos
> gkokolatos@protonmail.com wrote:
>
> > As a general overview, the series of patches in the mail thread do match their description. The addition of the
stricter,explicit use of instrumentation does improve the design as the distinction of the use cases requiring a pin or
alock is made more clear. The added commentary is descriptive and appears grammatically correct, at least to a non
nativespeaker. 
>
> I didn't see this review until now because it ended up in gmail's spam
> folder. :-(
>
> Thanks for taking a look at it!

No worries at all. It happens and it was beneficial for me to read the patch.

//Georgios
>
>
----------------------------------------------------------------------------------------------------------------------
>
> Peter Geoghegan