Thread: gcov coverage data not full with immediate stop
Hello hackers, I've found that gcov coverage data miss some information when a postgres node stopped in 'immediate' mode. For example, on the master branch: make coverage-clean; time make check -C src/test/recovery/; make coverage-html generates a coverage report with 106193 lines/6318 functions for me (`make check` takes 1m34s). But with the attached simple patch I get a coverage report with 106540 lines/6332 functions (and `make check` takes 2m5s). (IMO, the slowdown of the test is significant.) So if we want to make the coverage reports more precise, I see the three ways: 1. Change the stop mode in teardown_node to fast (probably only when configured with --enable-coverage); 2. Explicitly stop nodes in TAP tests (where it's important) -- seems too tedious and troublesome; 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)? Best regards, Alexander
Attachment
(Strangely, I was just thinking about these branches of mine as I closed my week last Friday...) On 2020-May-10, Alexander Lakhin wrote: > So if we want to make the coverage reports more precise, I see the three > ways: > 1. Change the stop mode in teardown_node to fast (probably only when > configured with --enable-coverage); > 2. Explicitly stop nodes in TAP tests (where it's important) -- seems > too tedious and troublesome; > 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)? I tried your idea 3 a long time ago and my experiments didn't show an increase in coverage [1]. But I like this idea the best, and maybe I did something wrong. Attached is the patch I had (on top of fc115d0f9fc6), but I don't know if it still applies. (The second attachment is another branch I had on this, I don't remember why; that one was on top of 438e51987dcc. The curious thing is that I didn't add the __gcov_flush to quickdie in this one. Maybe what we need is a mix of both.) I think we should definitely get this fixed for pg13 ... [1] https://postgr.es/m/20190531170503.GA24057@alvherre.pgsql -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-May-10, Alexander Lakhin wrote: >> 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)? > I tried your idea 3 a long time ago and my experiments didn't show an > increase in coverage [1]. But I like this idea the best, and maybe I > did something wrong. Attached is the patch I had (on top of > fc115d0f9fc6), but I don't know if it still applies. Putting ill-defined, not-controlled-by-us work into a quickdie signal handler sounds like a really bad idea to me. Maybe it's all right, since presumably it would only appear in specialized test builds; but even so, how much could you trust the results? > I think we should definitely get this fixed for pg13 ... -1 for shoving in such a thing so late in the cycle. We've survived without it for years, we can do so for a few months more. regards, tom lane
Hello Alvaro,
11.05.2020 06:42, Alvaro Herrera wrote:
11.05.2020 06:42, Alvaro Herrera wrote:
Thanks for the reference to that discussion and your patch.(Strangely, I was just thinking about these branches of mine as I closed my week last Friday...) On 2020-May-10, Alexander Lakhin wrote:So if we want to make the coverage reports more precise, I see the three ways: 1. Change the stop mode in teardown_node to fast (probably only when configured with --enable-coverage); 2. Explicitly stop nodes in TAP tests (where it's important) -- seems too tedious and troublesome; 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)?I tried your idea 3 a long time ago and my experiments didn't show an increase in coverage [1]. But I like this idea the best, and maybe I did something wrong. Attached is the patch I had (on top of fc115d0f9fc6), but I don't know if it still applies.
As I see the issue with that patch is that quickdie() is not the only SIGQUIT handler. When a backend is interrupted with SIGQUIT, it's exiting in SignalHandlerForCrashExit().
In fact if I only add __gcov_flush() in SignalHandlerForCrashExit(), it raises test coverage for `make check -C src/test/recovery/` from
106198 lines/6319 functions
to
106420 lines/6328 functions
It's not yet clear to me what happens when __gcov_flush() called inside __gcov_flush().
The test coverage changes to:
108432 lines/5417 functions
(number of function calls decreased)
And for example in coverage/src/backend/utils/cache/catcache.c.gcov.html I see
147 8 : int2eqfast(Datum a, Datum b)
...
153 0 : int2hashfast(Datum datum)
but without __gcov_flush in quickdie() we have:
147 78038 : int2eqfast(Datum a, Datum b)
...
153 255470 : int2hashfast(Datum datum)
So it needs more investigation.
But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report.
I tried to develop a test to elevate a coverage for gist:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html
(Please look at the attached test if it could be interesting.)
and came to this issue with a coverage. I tried to play with GCOV_PREFIX, but without luck.
Yesterday I found the more recent discussion:
https://www.postgresql.org/message-id/flat/44ecae53-9861-71b7-1d43-4658acc52519%402ndquadrant.com#d02e2e61212831fbceadf290637913a0
(where probably the same problem came out).
Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit() helps too).
Best regards,
Alexander
Attachment
On Mon, May 11, 2020 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report. > > Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit()helps too). What happens if a coverage tool other than gcov is used? From that perspective, it's better to perform a clean shutdown in the TAP tests instead of immediate if that's possible. -- Best Wishes, Ashutosh Bapat
On Mon, May 11, 2020 at 12:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think we should definitely get this fixed for pg13 ... > > -1 for shoving in such a thing so late in the cycle. We've survived > without it for years, we can do so for a few months more. I agree, but also, we should start thinking about when to branch. I, too, have patches that aren't critical enough to justify pushing them post-freeze, but which are still good improvements that I'd like to get into the tree. I'm queueing them right now to avoid the risk of destabilizing things, but that generates more work, for me and for other people, if their patches force me to rebase or the other way around. I know there's always a concern with removing the focus on release N too soon, but the open issues list is 3 items long right now, and 2 of those look like preexisting issues, not new problems in v13. Meanwhile, we have 20+ active committers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I agree, but also, we should start thinking about when to branch. I, > too, have patches that aren't critical enough to justify pushing them > post-freeze, but which are still good improvements that I'd like to > get into the tree. I'm queueing them right now to avoid the risk of > destabilizing things, but that generates more work, for me and for > other people, if their patches force me to rebase or the other way > around. I know there's always a concern with removing the focus on > release N too soon, but the open issues list is 3 items long right > now, and 2 of those look like preexisting issues, not new problems in > v13. Meanwhile, we have 20+ active committers. Yeah. Traditionally we've waited till the start of the next commitfest (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide differently). But it seems like things are slow enough that perhaps we could branch earlier, like June 1, and give the committers a chance to deal with some of their own stuff before starting the CF. This is the wrong thread to be debating that in, though. Also I wonder if this is really RMT turf? regards, tom lane
On Mon, May 11, 2020 at 4:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is the wrong thread to be debating that in, though. True. > Also I wonder if this is really RMT turf? I think it is, but the RMT is permitted -- even encouraged -- to consider the views of non-RMT members when making its decision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah. Traditionally we've waited till the start of the next commitfest > (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide > differently). But it seems like things are slow enough that perhaps > we could branch earlier, like June 1, and give the committers a chance > to deal with some of their own stuff before starting the CF. The RMT discussed this question informally yesterday. The consensus is that we should wait and see what the early feedback from Beta 1 is before making a final decision. An earlier June 1 branch date is an idea that certainly has some merit, but we'd like to put off making a final decision on that for at least another week, and possibly as long as two weeks. Can that easily be accommodated? -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, May 11, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. Traditionally we've waited till the start of the next commitfest >> (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide >> differently). But it seems like things are slow enough that perhaps >> we could branch earlier, like June 1, and give the committers a chance >> to deal with some of their own stuff before starting the CF. > The RMT discussed this question informally yesterday. The consensus is > that we should wait and see what the early feedback from Beta 1 is > before making a final decision. An earlier June 1 branch date is an > idea that certainly has some merit, but we'd like to put off making a > final decision on that for at least another week, and possibly as long > as two weeks. > Can that easily be accommodated? There's no real lead time needed AFAICS: when we are ready to branch, we can just do it. So sure, let's wait till the end of May to decide. If things look bad then, we could reconsider again mid-June. regards, tom lane
On Mon, May 11, 2020 at 05:56:33PM +0530, Ashutosh Bapat wrote: > What happens if a coverage tool other than gcov is used? From that > perspective, it's better to perform a clean shutdown in the TAP tests > instead of immediate if that's possible. Nope, as that's the fastest path we have to shut down any remaining nodes at the end of a test per the END{} block at the end of PostgresNode.pm, and I would rather keep it this way because people tend to like keeping around a lot of clusters alive at the end of any new test added and shutdown checkpoints are not free either even if fsync is enforced to off in the tests. I think that a solution turning around __gcov_flush() could be the best deal we have, as discussed last year in the thread Álvaro quoted upthread, and I would vote for waiting until v14 opens for business before merging something we consider worth it. -- Michael
Attachment
On Tue, May 12, 2020 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Can that easily be accommodated? > > There's no real lead time needed AFAICS: when we are ready to branch, > we can just do it. So sure, let's wait till the end of May to decide. > If things look bad then, we could reconsider again mid-June. Great. Let's review it at the end of May, before actually branching. -- Peter Geoghegan