Thread: gcov coverage data not full with immediate stop

gcov coverage data not full with immediate stop

From
Alexander Lakhin
Date:
Hello hackers,

I've found that gcov coverage data miss some information when a postgres
node stopped in 'immediate' mode.
For example, on the master branch:
make coverage-clean; time make check -C src/test/recovery/; make
coverage-html
generates a coverage report with 106193 lines/6318 functions for me
(`make check` takes 1m34s).
But with the attached simple patch I get a coverage report with 106540
lines/6332 functions (and `make check` takes 2m5s).
(IMO, the slowdown of the test is significant.)

So if we want to make the coverage reports more precise, I see the three
ways:
1. Change the stop mode in teardown_node to fast (probably only when
configured with --enable-coverage);
2. Explicitly stop nodes in TAP tests (where it's important) -- seems
too tedious and troublesome;
3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

Best regards,
Alexander

Attachment

Re: gcov coverage data not full with immediate stop

From
Alvaro Herrera
Date:
(Strangely, I was just thinking about these branches of mine as I
closed my week last Friday...)

On 2020-May-10, Alexander Lakhin wrote:

> So if we want to make the coverage reports more precise, I see the three
> ways:
> 1. Change the stop mode in teardown_node to fast (probably only when
> configured with --enable-coverage);
> 2. Explicitly stop nodes in TAP tests (where it's important) -- seems
> too tedious and troublesome;
> 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

I tried your idea 3 a long time ago and my experiments didn't show an
increase in coverage [1].  But I like this idea the best, and maybe I
did something wrong.  Attached is the patch I had (on top of
fc115d0f9fc6), but I don't know if it still applies.

(The second attachment is another branch I had on this, I don't remember
why; that one was on top of 438e51987dcc.  The curious thing is that I
didn't add the __gcov_flush to quickdie in this one.  Maybe what we need
is a mix of both.)

I think we should definitely get this fixed for pg13 ...

[1] https://postgr.es/m/20190531170503.GA24057@alvherre.pgsql

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: gcov coverage data not full with immediate stop

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-May-10, Alexander Lakhin wrote:
>> 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?

> I tried your idea 3 a long time ago and my experiments didn't show an
> increase in coverage [1].  But I like this idea the best, and maybe I
> did something wrong.  Attached is the patch I had (on top of
> fc115d0f9fc6), but I don't know if it still applies.

Putting ill-defined, not-controlled-by-us work into a quickdie signal
handler sounds like a really bad idea to me.  Maybe it's all right,
since presumably it would only appear in specialized test builds; but
even so, how much could you trust the results?

> I think we should definitely get this fixed for pg13 ...

-1 for shoving in such a thing so late in the cycle.  We've survived
without it for years, we can do so for a few months more.

            regards, tom lane



Re: gcov coverage data not full with immediate stop

From
Alexander Lakhin
Date:
Hello Alvaro,
11.05.2020 06:42, Alvaro Herrera wrote:
(Strangely, I was just thinking about these branches of mine as I
closed my week last Friday...)

On 2020-May-10, Alexander Lakhin wrote:

So if we want to make the coverage reports more precise, I see the three
ways:
1. Change the stop mode in teardown_node to fast (probably only when
configured with --enable-coverage);
2. Explicitly stop nodes in TAP tests (where it's important) -- seems
too tedious and troublesome;
3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?
I tried your idea 3 a long time ago and my experiments didn't show an
increase in coverage [1].  But I like this idea the best, and maybe I
did something wrong.  Attached is the patch I had (on top of
fc115d0f9fc6), but I don't know if it still applies.
Thanks for the reference to that discussion and your patch.
As I see the issue with that patch is that quickdie() is not the only SIGQUIT handler. When a backend is interrupted with SIGQUIT, it's exiting in SignalHandlerForCrashExit().
In fact if I only add __gcov_flush() in SignalHandlerForCrashExit(), it raises test coverage for `make check -C src/test/recovery/` from
106198 lines/6319 functions
to
106420 lines/6328 functions

It's not yet clear to me what happens when __gcov_flush() called inside __gcov_flush().
The test coverage changes to:
108432 lines/5417 functions
(number of function calls decreased)
And for example in coverage/src/backend/utils/cache/catcache.c.gcov.html I see
     147           8 : int2eqfast(Datum a, Datum b)
...
     153           0 : int2hashfast(Datum datum)
but without __gcov_flush in quickdie() we have:
     147       78038 : int2eqfast(Datum a, Datum b)
...
     153      255470 : int2hashfast(Datum datum)
So it needs more investigation.

But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report.
I tried to develop a test to elevate a coverage for gist:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html
(Please look at the attached test if it could be interesting.)
and came to this issue with a coverage. I tried to play with GCOV_PREFIX, but without luck.
Yesterday I found the more recent discussion:
https://www.postgresql.org/message-id/flat/44ecae53-9861-71b7-1d43-4658acc52519%402ndquadrant.com#d02e2e61212831fbceadf290637913a0
(where probably the same problem came out).

Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit() helps too).

Best regards,
Alexander
Attachment

Re: gcov coverage data not full with immediate stop

From
Ashutosh Bapat
Date:
On Mon, May 11, 2020 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

>
> But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report.
>
> Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in
SignalHandlerForCrashExit()helps too).
 

What happens if a coverage tool other than gcov is used? From that
perspective, it's better to perform a clean shutdown in the TAP tests
instead of immediate if that's possible.

-- 
Best Wishes,
Ashutosh Bapat



Re: gcov coverage data not full with immediate stop

From
Robert Haas
Date:
On Mon, May 11, 2020 at 12:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think we should definitely get this fixed for pg13 ...
>
> -1 for shoving in such a thing so late in the cycle.  We've survived
> without it for years, we can do so for a few months more.

I agree, but also, we should start thinking about when to branch. I,
too, have patches that aren't critical enough to justify pushing them
post-freeze, but which are still good improvements that I'd like to
get into the tree. I'm queueing them right now to avoid the risk of
destabilizing things, but that generates more work, for me and for
other people, if their patches force me to rebase or the other way
around. I know there's always a concern with removing the focus on
release N too soon, but the open issues list is 3 items long right
now, and 2 of those look like preexisting issues, not new problems in
v13. Meanwhile, we have 20+ active committers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: gcov coverage data not full with immediate stop

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I agree, but also, we should start thinking about when to branch. I,
> too, have patches that aren't critical enough to justify pushing them
> post-freeze, but which are still good improvements that I'd like to
> get into the tree. I'm queueing them right now to avoid the risk of
> destabilizing things, but that generates more work, for me and for
> other people, if their patches force me to rebase or the other way
> around. I know there's always a concern with removing the focus on
> release N too soon, but the open issues list is 3 items long right
> now, and 2 of those look like preexisting issues, not new problems in
> v13. Meanwhile, we have 20+ active committers.

Yeah.  Traditionally we've waited till the start of the next commitfest
(which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
differently).  But it seems like things are slow enough that perhaps
we could branch earlier, like June 1, and give the committers a chance
to deal with some of their own stuff before starting the CF.

This is the wrong thread to be debating that in, though.  Also I wonder
if this is really RMT turf?

            regards, tom lane



Re: gcov coverage data not full with immediate stop

From
Robert Haas
Date:
On Mon, May 11, 2020 at 4:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is the wrong thread to be debating that in, though.

True.

> Also I wonder if this is really RMT turf?

I think it is, but the RMT is permitted -- even encouraged -- to
consider the views of non-RMT members when making its decision.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: gcov coverage data not full with immediate stop

From
Peter Geoghegan
Date:
On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  Traditionally we've waited till the start of the next commitfest
> (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
> differently).  But it seems like things are slow enough that perhaps
> we could branch earlier, like June 1, and give the committers a chance
> to deal with some of their own stuff before starting the CF.

The RMT discussed this question informally yesterday. The consensus is
that we should wait and see what the early feedback from Beta 1 is
before making a final decision. An earlier June 1 branch date is an
idea that certainly has some merit, but we'd like to put off making a
final decision on that for at least another week, and possibly as long
as two weeks.

Can that easily be accommodated?

-- 
Peter Geoghegan



Re: gcov coverage data not full with immediate stop

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  Traditionally we've waited till the start of the next commitfest
>> (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
>> differently).  But it seems like things are slow enough that perhaps
>> we could branch earlier, like June 1, and give the committers a chance
>> to deal with some of their own stuff before starting the CF.

> The RMT discussed this question informally yesterday. The consensus is
> that we should wait and see what the early feedback from Beta 1 is
> before making a final decision. An earlier June 1 branch date is an
> idea that certainly has some merit, but we'd like to put off making a
> final decision on that for at least another week, and possibly as long
> as two weeks.

> Can that easily be accommodated?

There's no real lead time needed AFAICS: when we are ready to branch,
we can just do it.  So sure, let's wait till the end of May to decide.
If things look bad then, we could reconsider again mid-June.

            regards, tom lane



Re: gcov coverage data not full with immediate stop

From
Michael Paquier
Date:
On Mon, May 11, 2020 at 05:56:33PM +0530, Ashutosh Bapat wrote:
> What happens if a coverage tool other than gcov is used? From that
> perspective, it's better to perform a clean shutdown in the TAP tests
> instead of immediate if that's possible.

Nope, as that's the fastest path we have to shut down any remaining
nodes at the end of a test per the END{} block at the end of
PostgresNode.pm, and I would rather keep it this way because people
tend to like keeping around a lot of clusters alive at the end of any
new test added and shutdown checkpoints are not free either even if
fsync is enforced to off in the tests.

I think that a solution turning around __gcov_flush() could be the
best deal we have, as discussed last year in the thread Álvaro quoted
upthread, and I would vote for waiting until v14 opens for business
before merging something we consider worth it.
--
Michael

Attachment

Re: gcov coverage data not full with immediate stop

From
Peter Geoghegan
Date:
On Tue, May 12, 2020 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Can that easily be accommodated?
>
> There's no real lead time needed AFAICS: when we are ready to branch,
> we can just do it.  So sure, let's wait till the end of May to decide.
> If things look bad then, we could reconsider again mid-June.

Great. Let's review it at the end of May, before actually branching.

-- 
Peter Geoghegan