Thread: Can we automatically add elapsed times to tap test log?

Can we automatically add elapsed times to tap test log?

From
Andres Freund
Date:
Hi,

right now I am looking at a test added in the shmstats patch that's slow on
CI, on windows only. Unfortunately the regress_log_* output is useless as-is
to figure out where things hang.

I've hit this several times before. Of course it's not too hard to hack up
something printing elapsed time. But ISTM that it'd be better if we were able
to prefix the logging into regress_log_* with something like
[timestamp + time since start of test]
or
[timestamp + time since start of test + time since last log message]


This isn't just useful to figure out what parts of test are slow, but also
helps correlate server logs with the regress_log_* output. Which right now is
hard and inaccurate, requiring manually correlating statements between server
log and the tap test (often there's no logging for statements in the
regress_log_*).

Greetings,

Andres Freund



Re: Can we automatically add elapsed times to tap test log?

From
Nathan Bossart
Date:
On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
> right now I am looking at a test added in the shmstats patch that's slow on
> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
> to figure out where things hang.
> 
> I've hit this several times before. Of course it's not too hard to hack up
> something printing elapsed time. But ISTM that it'd be better if we were able
> to prefix the logging into regress_log_* with something like
> [timestamp + time since start of test]
> or
> [timestamp + time since start of test + time since last log message]
> 
> 
> This isn't just useful to figure out what parts of test are slow, but also
> helps correlate server logs with the regress_log_* output. Which right now is
> hard and inaccurate, requiring manually correlating statements between server
> log and the tap test (often there's no logging for statements in the
> regress_log_*).

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/1/22 13:44, Nathan Bossart wrote:
> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
>> right now I am looking at a test added in the shmstats patch that's slow on
>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
>> to figure out where things hang.
>>
>> I've hit this several times before. Of course it's not too hard to hack up
>> something printing elapsed time. But ISTM that it'd be better if we were able
>> to prefix the logging into regress_log_* with something like
>> [timestamp + time since start of test]
>> or
>> [timestamp + time since start of test + time since last log message]
>>
>>
>> This isn't just useful to figure out what parts of test are slow, but also
>> helps correlate server logs with the regress_log_* output. Which right now is
>> hard and inaccurate, requiring manually correlating statements between server
>> log and the tap test (often there's no logging for statements in the
>> regress_log_*).
> +1
>


Maybe one way would be to make a change in
src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
just be to add a timestamp, the other things would involve a bit more
bookkeeping. It should also be checked to make sure it doesn't add too
much overhead, although I would be surprised if it did.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Can we automatically add elapsed times to tap test log?

From
Andres Freund
Date:
Hi,

On 2022-04-01 10:21:50 -0700, Andres Freund wrote:
> right now I am looking at a test added in the shmstats patch that's slow on
> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
> to figure out where things hang.

<two hours of debugging later>

This turns out to be a problem somewhere in the tap testing infrastructure,
rather than the test itself. The slow thing wasn't anything the test did. All
the time is spent in an is().  To verify that, I added a bunch of

ok(1, "this is some long output to test a theory");
print_time();

a few tests before the slow test. And:

ok 7 - this is some long output to test a theory
# test theory 1: 0.000 sec
ok 8 - this is some long output to test a theory
# test theory 2: 0.000 sec
ok 9 - this is some long output to test a theory
# test theory 3: 40.484 sec
ok 10 - this is some long output to test a theory
# test theory 4: 0.001 sec
ok 11 - this is some long output to test a theory
# test theory 5: 0.000 sec

The problem also vanishes when running tests without PROVE_FLAGS=-j$something


What this looks like to me is that when running tests concurrently, the buffer
of the file descriptor used to report tap test output fills up. The blocked
test can only progress once prove gets around to reading from that fd,
presumably when another test finishes.

Gah. I want my time back.


I can't reproduce a similar issue on linux. But of course I'm using a newer
perl, and it's likely a timing dependent issue, so it's not guaranteed to be a
windows problem. But ...


Greetings,

Andres Freund



Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/1/22 15:16, Andrew Dunstan wrote:
> On 4/1/22 13:44, Nathan Bossart wrote:
>> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
>>> right now I am looking at a test added in the shmstats patch that's slow on
>>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
>>> to figure out where things hang.
>>>
>>> I've hit this several times before. Of course it's not too hard to hack up
>>> something printing elapsed time. But ISTM that it'd be better if we were able
>>> to prefix the logging into regress_log_* with something like
>>> [timestamp + time since start of test]
>>> or
>>> [timestamp + time since start of test + time since last log message]
>>>
>>>
>>> This isn't just useful to figure out what parts of test are slow, but also
>>> helps correlate server logs with the regress_log_* output. Which right now is
>>> hard and inaccurate, requiring manually correlating statements between server
>>> log and the tap test (often there's no logging for statements in the
>>> regress_log_*).
>> +1
>>
>
> Maybe one way would be to make a change in
> src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
> just be to add a timestamp, the other things would involve a bit more
> bookkeeping. It should also be checked to make sure it doesn't add too
> much overhead, although I would be surprised if it did.
>


Along these lines. Untested, it clearly needs a bit of polish (e.g. a
way to turn it on or off for a filehandle). We could use Time::Hires if
you want higher resolution times.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/1/22 16:25, Andrew Dunstan wrote:
> On 4/1/22 15:16, Andrew Dunstan wrote:
>> On 4/1/22 13:44, Nathan Bossart wrote:
>>> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
>>>> right now I am looking at a test added in the shmstats patch that's slow on
>>>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is
>>>> to figure out where things hang.
>>>>
>>>> I've hit this several times before. Of course it's not too hard to hack up
>>>> something printing elapsed time. But ISTM that it'd be better if we were able
>>>> to prefix the logging into regress_log_* with something like
>>>> [timestamp + time since start of test]
>>>> or
>>>> [timestamp + time since start of test + time since last log message]
>>>>
>>>>
>>>> This isn't just useful to figure out what parts of test are slow, but also
>>>> helps correlate server logs with the regress_log_* output. Which right now is
>>>> hard and inaccurate, requiring manually correlating statements between server
>>>> log and the tap test (often there's no logging for statements in the
>>>> regress_log_*).
>>> +1
>>>
>> Maybe one way would be to make a change in
>> src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
>> just be to add a timestamp, the other things would involve a bit more
>> bookkeeping. It should also be checked to make sure it doesn't add too
>> much overhead, although I would be surprised if it did.
>>
>
> Along these lines. Untested, it clearly needs a bit of polish (e.g. a
> way to turn it on or off for a filehandle). We could use Time::Hires if
> you want higher resolution times.
>
>


Here's a version that actually works. It produces traces that look like
this:


andrew@emma:pg_upgrade $ grep '([0-9]*s)'
tmp_check/log/regress_log_002_pg_upgrade
[21:55:06](63s) ok 1 - dump before running pg_upgrade
[21:55:22](79s) ok 2 - run of pg_upgrade for new instance
[21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
[21:55:27](84s) 1..3


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/2/22 06:57, Andrew Dunstan wrote:
> Here's a version that actually works. It produces traces that look like
> this:
>
>
> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
> tmp_check/log/regress_log_002_pg_upgrade
> [21:55:06](63s) ok 1 - dump before running pg_upgrade
> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
> [21:55:27](84s) 1..3
>

I know there's a lot going on, but are people interested in this? It's a
pretty small patch to produce something that seems quite useful.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Can we automatically add elapsed times to tap test log?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 4/2/22 06:57, Andrew Dunstan wrote:
>> Here's a version that actually works. It produces traces that look like
>> this:
>> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
>> tmp_check/log/regress_log_002_pg_upgrade
>> [21:55:06](63s) ok 1 - dump before running pg_upgrade
>> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
>> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
>> [21:55:27](84s) 1..3

> I know there's a lot going on, but are people interested in this? It's a
> pretty small patch to produce something that seems quite useful.

I too think that the elapsed time is useful.  I'm less convinced
that the time-of-day marker is useful.

It also seems kind of odd that the elapsed time accumulates rather
than being reset for each line.  As it stands one would be doing a lot
of mental subtractions rather than being able to see directly how long
each step takes.  I suppose that on fast machines where each step is
under one second, accumulation would be more useful than printing a
lot of zeroes --- but on the other hand, those aren't the cases where
you're going to be terribly concerned about the runtime.

            regards, tom lane



Re: Can we automatically add elapsed times to tap test log?

From
Andres Freund
Date:
Hi,

On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 4/2/22 06:57, Andrew Dunstan wrote:
> >> Here's a version that actually works. It produces traces that look like
> >> this:
> >> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
> >> tmp_check/log/regress_log_002_pg_upgrade
> >> [21:55:06](63s) ok 1 - dump before running pg_upgrade
> >> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
> >> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
> >> [21:55:27](84s) 1..3
> 
> > I know there's a lot going on, but are people interested in this? It's a
> > pretty small patch to produce something that seems quite useful.

It's been 0 days since I last wanted this.


> I too think that the elapsed time is useful.  I'm less convinced
> that the time-of-day marker is useful.

I think it'd be quite useful if it had more precision - it's a pita to
correlate regress_log_* output with server logs.


> It also seems kind of odd that the elapsed time accumulates rather
> than being reset for each line.  As it stands one would be doing a lot
> of mental subtractions rather than being able to see directly how long
> each step takes.  I suppose that on fast machines where each step is
> under one second, accumulation would be more useful than printing a
> lot of zeroes --- but on the other hand, those aren't the cases where
> you're going to be terribly concerned about the runtime.

I like both - if you want to find where the slowdown among a lot of log lines
is, it's easier to look at the time accumulated elapsed time. If you actually
want to see how long individual things take, non-accumulated is more useful.

I've printed both in the past...

Any chance we could print higher res time?

Greetings,

Andres Freund



Re: Can we automatically add elapsed times to tap test log?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
>> I too think that the elapsed time is useful.  I'm less convinced
>> that the time-of-day marker is useful.

> I think it'd be quite useful if it had more precision - it's a pita to
> correlate regress_log_* output with server logs.

Fair point.  Maybe we could keep the timestamp (with ms precision
if possible) and then the parenthetical bit is time-since-last-line
(also with ms precision)?  I think that would more or less satisfy
both uses.

            regards, tom lane



Re: Can we automatically add elapsed times to tap test log?

From
Andres Freund
Date:
Hi,

On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> >> I too think that the elapsed time is useful.  I'm less convinced
> >> that the time-of-day marker is useful.
> 
> > I think it'd be quite useful if it had more precision - it's a pita to
> > correlate regress_log_* output with server logs.
> 
> Fair point.  Maybe we could keep the timestamp (with ms precision
> if possible) and then the parenthetical bit is time-since-last-line
> (also with ms precision)?  I think that would more or less satisfy
> both uses.

Would work for me...

Greetings,

Andres Freund



Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/7/22 17:58, Andres Freund wrote:
> Hi,
>
> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
>>>> I too think that the elapsed time is useful.  I'm less convinced
>>>> that the time-of-day marker is useful.
>>> I think it'd be quite useful if it had more precision - it's a pita to
>>> correlate regress_log_* output with server logs.
>> Fair point.  Maybe we could keep the timestamp (with ms precision
>> if possible) and then the parenthetical bit is time-since-last-line
>> (also with ms precision)?  I think that would more or less satisfy
>> both uses.
> Would work for me...
>

All doable. Time::HiRes gives us a higher resolution timer. I'll post a
new version in a day or two.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/7/22 19:55, Andrew Dunstan wrote:
> On 4/7/22 17:58, Andres Freund wrote:
>> Hi,
>>
>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>>> Andres Freund <andres@anarazel.de> writes:
>>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
>>>>> I too think that the elapsed time is useful.  I'm less convinced
>>>>> that the time-of-day marker is useful.
>>>> I think it'd be quite useful if it had more precision - it's a pita to
>>>> correlate regress_log_* output with server logs.
>>> Fair point.  Maybe we could keep the timestamp (with ms precision
>>> if possible) and then the parenthetical bit is time-since-last-line
>>> (also with ms precision)?  I think that would more or less satisfy
>>> both uses.
>> Would work for me...
>>
> All doable. Time::HiRes gives us a higher resolution timer. I'll post a
> new version in a day or two.


New version attached.


Sample traces:


andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' regress_log_020_pg_receivewal | tail -n 15
[09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support
[09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support
[09:22:45.296](0.265s) ok 32 - streaming some WAL
[09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete
[09:22:45.298](0.001s) ok 34 - check stream dir permissions
[09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming point
[09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit code not 0
[09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: matches
[09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn
[09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived
[09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from the slot's position
[09:22:46.598](0.001s) ok 40 - WAL segment 00000001000000000000000B archived after timeline jump
[09:22:46.598](0.000s) ok 41 - WAL segment 00000002000000000000000C archived after timeline jump
[09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline jump
[09:22:46.599](0.001s) 1..42


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: Can we automatically add elapsed times to tap test log?

From
Andrew Dunstan
Date:
On 4/8/22 09:51, Andrew Dunstan wrote:
> On 4/7/22 19:55, Andrew Dunstan wrote:
>> On 4/7/22 17:58, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>>>> Andres Freund <andres@anarazel.de> writes:
>>>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
>>>>>> I too think that the elapsed time is useful.  I'm less convinced
>>>>>> that the time-of-day marker is useful.
>>>>> I think it'd be quite useful if it had more precision - it's a pita to
>>>>> correlate regress_log_* output with server logs.
>>>> Fair point.  Maybe we could keep the timestamp (with ms precision
>>>> if possible) and then the parenthetical bit is time-since-last-line
>>>> (also with ms precision)?  I think that would more or less satisfy
>>>> both uses.
>>> Would work for me...
>>>
>> All doable. Time::HiRes gives us a higher resolution timer. I'll post a
>> new version in a day or two.
>
> New version attached.
>
>
> Sample traces:
>
>
> andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' regress_log_020_pg_receivewal | tail -n 15
> [09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support
> [09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support
> [09:22:45.296](0.265s) ok 32 - streaming some WAL
> [09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete
> [09:22:45.298](0.001s) ok 34 - check stream dir permissions
> [09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming point
> [09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit code not 0
> [09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: matches
> [09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn
> [09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived
> [09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from the slot's position
> [09:22:46.598](0.001s) ok 40 - WAL segment 00000001000000000000000B archived after timeline jump
> [09:22:46.598](0.000s) ok 41 - WAL segment 00000002000000000000000C archived after timeline jump
> [09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline jump
> [09:22:46.599](0.001s) 1..42
>
>


In the absence of further comment I have pushed this.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Can we automatically add elapsed times to tap test log?

From
Andres Freund
Date:
On 2022-04-10 09:23:16 -0400, Andrew Dunstan wrote:
> In the absence of further comment I have pushed this.

Thanks!