Thread: Bug in bttext_abbrev_convert()

Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
Commits b181a919 and arguably c79b6413 added bugs to
bttext_abbrev_convert() in the process of fixing some others. In the
master branch, bttext_abbrev_convert() can leak memory when the C
locale happens to be used and we must detoast, which is unacceptable
for about the same reason that it's unacceptable for a standard B-Tree
comparator routine. There could also be a use-after-free issue for
large strings that are detoasted, because bttext_abbrev_convert()
hashes memory which might already be freed (when hashing the
authoritative value).

Attached patch fixes these issues.

As we all know, the state of automated testing is pretty lamentable.
This is the kind of thing that we could catch more easily in the
future if better infrastructure were in place. I caught this by
eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
last time I looked.
--
Peter Geoghegan

Attachment

Re: Bug in bttext_abbrev_convert()

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 7:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Commits b181a919 and arguably c79b6413 added bugs to
> bttext_abbrev_convert() in the process of fixing some others. In the
> master branch, bttext_abbrev_convert() can leak memory when the C
> locale happens to be used and we must detoast, which is unacceptable
> for about the same reason that it's unacceptable for a standard B-Tree
> comparator routine. There could also be a use-after-free issue for
> large strings that are detoasted, because bttext_abbrev_convert()
> hashes memory which might already be freed (when hashing the
> authoritative value).
>
> Attached patch fixes these issues.
>
> As we all know, the state of automated testing is pretty lamentable.
> This is the kind of thing that we could catch more easily in the
> future if better infrastructure were in place. I caught this by
> eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
> last time I looked.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Bug in bttext_abbrev_convert()

From
Jim Nasby
Date:
On 6/29/15 6:47 PM, Peter Geoghegan wrote:
> As we all know, the state of automated testing is pretty lamentable.
> This is the kind of thing that we could catch more easily in the
> future if better infrastructure were in place. I caught this by
> eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
> last time I looked.

Isn't this the kind of thing Coverty's supposed to find?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Tue, Jun 30, 2015 at 5:25 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Isn't this the kind of thing Coverty's supposed to find?

I don't know, but in general I'm not very excited about static
analysis tools. The best things that they have going for them is that
they're available, and don't require test coverage in the same way as
running the regression tests with Valgrind enabled.

There is no real testing of sorting in the regression tests. It would
be nice to have a way of generating a large and varied selection of
sort operations programmatically, to catch this kind of thing.
pg_regress is not really up to it. The same applies to various other
cases where having a lot of "expected" output makes using pg_regress
infeasible.

-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Michael Paquier
Date:
On Wed, Jul 1, 2015 at 9:36 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jun 30, 2015 at 5:25 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> Isn't this the kind of thing Coverty's supposed to find?
>
> I don't know, but in general I'm not very excited about static
> analysis tools. The best things that they have going for them is that
> they're available, and don't require test coverage in the same way as
> running the regression tests with Valgrind enabled.

Well, yes. It should have complained about that if it were a perfect
tool able to emulate all code paths with all possible variable values.
But Coverity is not actually that perfect, and sometimes it misses the
shot, like here. by experience when you look at reports of a static
tool, you need to have a look first at other code paths that share
similarities with the problem reported, and you will actually see that
most of the time what the static tool saw is just the tip of the
iceberg. The human factor is determinant in this case.

> There is no real testing of sorting in the regression tests. It would
> be nice to have a way of generating a large and varied selection of
> sort operations programmatically, to catch this kind of thing.
> pg_regress is not really up to it. The same applies to various other
> cases where having a lot of "expected" output makes using pg_regress
> infeasible.

Well, the issue is double here:
1) Having a buildfarm member with valgrind would greatly help.
2) This code path is not used at all AFAIK in the test suite, so we
definitely lack regression tests here. In your opinion what would be a
sort set large enough to be able to trigger this code path? The idea
is to not make the regression test suite take too much time because of
it, and not to have the server created by pg_regress running the
regression tests having a too large PGDATA folder. For example, could
a btree index do it with a correct data set do it on at least one
platform?
Regards,
-- 
Michael



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Tue, Jun 30, 2015 at 9:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> There is no real testing of sorting in the regression tests. It would
>> be nice to have a way of generating a large and varied selection of
>> sort operations programmatically, to catch this kind of thing.
>> pg_regress is not really up to it. The same applies to various other
>> cases where having a lot of "expected" output makes using pg_regress
>> infeasible.
>
> Well, the issue is double here:
> 1) Having a buildfarm member with valgrind would greatly help.

Not sure that I agree with that. It wouldn't hurt, but people are
running Valgrind often enough these days that I think it's unlikely
that having it run consistently actually adds all that much. Maybe it
would make a difference to do it on a 32-bit platform, since most of
us have not used a 32-bit machine as our primary development
environment in about a decade, and we probably miss some things
because of that.

If you run Valgrind on Postgres master these days, you actually have a
pretty good chance of not finding any issues, which is great.
shared_buffers support for Valgrind would also help a lot (e.g. making
it so that referencing a shared buffer that isn't pinned causes a
Valgrind error -- Noah and I discussed this a while back; should be
doable). If you're looking for a new project, may I highly recommend
working on that.  :-)

> 2) This code path is not used at all AFAIK in the test suite, so we
> definitely lack regression tests here. In your opinion what would be a
> sort set large enough to be able to trigger this code path? The idea
> is to not make the regression test suite take too much time because of
> it, and not to have the server created by pg_regress running the
> regression tests having a too large PGDATA folder. For example, could
> a btree index do it with a correct data set do it on at least one
> platform?

Maybe. The trick would be constructing a case where many different
code paths are covered, including the many different permutations and
combinations of how an external sort can go (number of runs, merge
steps, datatypes, etc).

In general, I think that there is a lot of value to be added by
someone making it their explicit goal to increase test coverage, as
measured by a tool like gcov (plus subjective expert analysis, of
course), particularly when it comes to things like memory corruption
bugs (hopefully including shared memory corruption bugs, and hopefully
including recovery). If someone was doing that in a focused way, then
the codepath where we must explicitly pfree() (in the case of this
bug) would probably have coverage, and then Valgrind probably would
catch this.

As long as that has to be a part of adding things to the standard
regression test suite (particularly with sorts), a suite which is
expected to run quickly, and as long as we're entirely relying on
pg_regress, we will not make progress here. Maybe if there was an
extended regression test suite that was explicitly about meeting a
code coverage goal we'd do better. Yes, I think mechanically
increasing code coverage is useful as an end in itself (although I do
accept that meeting a particularly coverage percentage is often not
especially useful). Increasing coverage has led me to the right
question before, just as static analysis has done that for you. For
example, the effort of increasing coverage can find dead code, and you
can intuitively get a sense of where to look for bugs manually by
determining mechanically what code paths are hard to hit, or are
naturally seldom hit.

It would be nice to always have a html report from gcov always
available on the internet. That would be something useful to automate,
IMV. Watching that regress over time might provide useful insight, but
I only use gcov a couple of times a year, so that's not going to
happen on its own.
-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Tue, Jun 30, 2015 at 10:25 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It would be nice to always have a html report from gcov always
> available on the internet. That would be something useful to automate,
> IMV. Watching that regress over time might provide useful insight, but
> I only use gcov a couple of times a year, so that's not going to
> happen on its own.

I generated such a report just now, and noticed this:
  1137 :     tuplesort_performsort(btspool->sortstate);    214        1131 :     if (btspool2)    215           0 :
   tuplesort_performsort(btspool2->sortstate);    216             :    217        1131 :     wstate.heap =
btspool->heap;   218        1131 :     wstate.index = btspool->index;
 

The regression tests have zero coverage for this
tuplesort_performsort() "btspool2" case. That's a fairly common case
to have no coverage for, and that took me all of 5 minutes to find.
-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Tue, Jun 30, 2015 at 10:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The regression tests have zero coverage for this
> tuplesort_performsort() "btspool2" case. That's a fairly common case
> to have no coverage for, and that took me all of 5 minutes to find.

BTW, I looked here because I added a bunch of sortsupport stuff to
_bt_load() in 9.5. It appears that that new code is entirely without
coverage.

-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Michael Paquier
Date:
On Wed, Jul 1, 2015 at 2:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jun 30, 2015 at 10:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> The regression tests have zero coverage for this
>> tuplesort_performsort() "btspool2" case. That's a fairly common case
>> to have no coverage for, and that took me all of 5 minutes to find.
>
> BTW, I looked here because I added a bunch of sortsupport stuff to
> _bt_load() in 9.5. It appears that that new code is entirely without
> coverage.

That's not cool. A patch for the src/test/regress area would be welcome.
-- 
Michael



Re: Bug in bttext_abbrev_convert()

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:

> It would be nice to always have a html report from gcov always
> available on the internet. That would be something useful to automate,
> IMV.

http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Bug in bttext_abbrev_convert()

From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Peter Geoghegan wrote:

>> It would be nice to always have a html report from gcov always
>> available on the internet. That would be something useful to
>> automate, IMV.
>
> http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/

What would it take to get something like that which uses the
check-world target instead of just the check target?  Without the
additional tests (like the isolation tests), some of these numbers
don't reflect the coverage of regularly run tests.  While it is of
some interest what coverage we get on the 30 second `make check`
runs, I would be a lot more interested in what our coverage is when
the full 8.5 minute set of regression tests we have in git is run.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Mon, Jul 6, 2015 at 12:27 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> What would it take to get something like that which uses the
> check-world target instead of just the check target?  Without the
> additional tests (like the isolation tests), some of these numbers
> don't reflect the coverage of regularly run tests.  While it is of
> some interest what coverage we get on the 30 second `make check`
> runs, I would be a lot more interested in what our coverage is when
> the full 8.5 minute set of regression tests we have in git is run.

I agree. Obviously in some cases the coverage is bad because
concurrency is naturally required. It's even more glaring that
recovery codepaths are always all-red in coverage reports.

-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Peter Geoghegan wrote:
> 
> >> It would be nice to always have a html report from gcov always
> >> available on the internet. That would be something useful to
> >> automate, IMV.
> >
> > http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/
> 
> What would it take to get something like that which uses the
> check-world target instead of just the check target?

I suggest CC'ing Peter as a first measure.  I already suggested this (or
something similar) to him months ago.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Mon, Jul 6, 2015 at 12:52 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I suggest CC'ing Peter as a first measure.  I already suggested this (or
> something similar) to him months ago.

This would be a worthwhile effort. tuplesort.c only has 46% coverage.
There is no coverage for functions that I know are used all the time
in production, like dumptuples(), or ExecHashIncreaseNumBatches().

We should make increasing test coverage an explicit goal. Ideally,
there'd be a lower quality set of tests that fill in certain gaps in
code coverage, but are used less frequently, and may take much longer
to run. Simply having the code executed will allow tools like Valgrind
to catch bugs.

-- 
Peter Geoghegan



Re: Bug in bttext_abbrev_convert()

From
Peter Eisentraut
Date:
On 7/6/15 3:52 PM, Alvaro Herrera wrote:
> Kevin Grittner wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> Peter Geoghegan wrote:
>>
>>>> It would be nice to always have a html report from gcov always
>>>> available on the internet. That would be something useful to
>>>> automate, IMV.
>>>
>>> http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/
>>
>> What would it take to get something like that which uses the
>> check-world target instead of just the check target?
> 
> I suggest CC'ing Peter as a first measure.  I already suggested this (or
> something similar) to him months ago.

I would like to do that, but the current host has no more capacity and
the hoster is complaining, so I need to look for a new hosting solution
before I can expand what is being built.





Re: Bug in bttext_abbrev_convert()

From
Andres Freund
Date:
On 2015-07-19 12:35:46 -0400, Peter Eisentraut wrote:
> I would like to do that, but the current host has no more capacity and
> the hoster is complaining, so I need to look for a new hosting solution
> before I can expand what is being built.

Perhaps this could be moved into a virtual machine hosted by the
postgres project infrastructure?

What kind of resources do you currently have and what would you need?

Andres



Re: Bug in bttext_abbrev_convert()

From
Peter Geoghegan
Date:
On Sun, Jul 19, 2015 at 9:53 AM, Andres Freund <andres@anarazel.de> wrote:
> Perhaps this could be moved into a virtual machine hosted by the
> postgres project infrastructure?

I would definitely be willing to start an external project to provide
additional coverage tests if I thought someone was inclined to run
them in this kind of environment. There *is* a place for scattergun
tests that just exercise rarely hit codepaths (or even frequently hit
codepaths that are not tested by the regression tests currently, often
due to requiring non-trivial amounts of data). Running the tests with
tools like Valgrind will catch a certain class of bug, like the one
reported and fixed on this thread.

The advantage of an external project is that being a committer isn't
tied to being a Postgres committer, and that it allows us to have
broader requirements to run all the tests. For example, it would be
acceptable to write some tests in Perl, and others in Python,
according to individual preference. Or, individual scripts could have
requirements that are only practical to satisfy with a
language-specific package manager like Python's pip, which would never
fly with the standard regression tests. I can imagine that kind of
thing being useful for generating a test-case involving very large
hash joins or sorts, where a procedural generation of the tests is
easier than having hundreds of megabytes of dummy data stored in git.

Weren't early versions of isolationtester written in Python + Twisted?

Noah suggested that Valgrind be extended to detect when a buffer was
read from despite not being pinned. This seems like a good idea to me,
but unless and until we exercise the code paths where this might be
happening, which tend to involve complex issues with concurrency, then
we won't get the full benefit of automated detection. I recall
noticing such a bug when reviewing the B-Tree page split patch that
went into 9.4. But the relevant code path isn't exercised at all, and
so even if Valgrind could have detected it, it probably wouldn't have
(the P_INCOMPLETE_SPLIT() path within _bt_moveright() has no
coverage).

I think that a lot of us are doing stuff like this privately already
to some extent, but we don't know about it because it's never
submitted because the requirements are too high.

-- 
Peter Geoghegan