Thread: Can we automatically add elapsed times to tap test log?
Hi, right now I am looking at a test added in the shmstats patch that's slow on CI, on windows only. Unfortunately the regress_log_* output is useless as-is to figure out where things hang. I've hit this several times before. Of course it's not too hard to hack up something printing elapsed time. But ISTM that it'd be better if we were able to prefix the logging into regress_log_* with something like [timestamp + time since start of test] or [timestamp + time since start of test + time since last log message] This isn't just useful to figure out what parts of test are slow, but also helps correlate server logs with the regress_log_* output. Which right now is hard and inaccurate, requiring manually correlating statements between server log and the tap test (often there's no logging for statements in the regress_log_*). Greetings, Andres Freund
On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote: > right now I am looking at a test added in the shmstats patch that's slow on > CI, on windows only. Unfortunately the regress_log_* output is useless as-is > to figure out where things hang. > > I've hit this several times before. Of course it's not too hard to hack up > something printing elapsed time. But ISTM that it'd be better if we were able > to prefix the logging into regress_log_* with something like > [timestamp + time since start of test] > or > [timestamp + time since start of test + time since last log message] > > > This isn't just useful to figure out what parts of test are slow, but also > helps correlate server logs with the regress_log_* output. Which right now is > hard and inaccurate, requiring manually correlating statements between server > log and the tap test (often there's no logging for statements in the > regress_log_*). +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 4/1/22 13:44, Nathan Bossart wrote: > On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote: >> right now I am looking at a test added in the shmstats patch that's slow on >> CI, on windows only. Unfortunately the regress_log_* output is useless as-is >> to figure out where things hang. >> >> I've hit this several times before. Of course it's not too hard to hack up >> something printing elapsed time. But ISTM that it'd be better if we were able >> to prefix the logging into regress_log_* with something like >> [timestamp + time since start of test] >> or >> [timestamp + time since start of test + time since last log message] >> >> >> This isn't just useful to figure out what parts of test are slow, but also >> helps correlate server logs with the regress_log_* output. Which right now is >> hard and inaccurate, requiring manually correlating statements between server >> log and the tap test (often there's no logging for statements in the >> regress_log_*). > +1 > Maybe one way would be to make a change in src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would just be to add a timestamp, the other things would involve a bit more bookkeeping. It should also be checked to make sure it doesn't add too much overhead, although I would be surprised if it did. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-04-01 10:21:50 -0700, Andres Freund wrote: > right now I am looking at a test added in the shmstats patch that's slow on > CI, on windows only. Unfortunately the regress_log_* output is useless as-is > to figure out where things hang. <two hours of debugging later> This turns out to be a problem somewhere in the tap testing infrastructure, rather than the test itself. The slow thing wasn't anything the test did. All the time is spent in an is(). To verify that, I added a bunch of ok(1, "this is some long output to test a theory"); print_time(); a few tests before the slow test. And: ok 7 - this is some long output to test a theory # test theory 1: 0.000 sec ok 8 - this is some long output to test a theory # test theory 2: 0.000 sec ok 9 - this is some long output to test a theory # test theory 3: 40.484 sec ok 10 - this is some long output to test a theory # test theory 4: 0.001 sec ok 11 - this is some long output to test a theory # test theory 5: 0.000 sec The problem also vanishes when running tests without PROVE_FLAGS=-j$something What this looks like to me is that when running tests concurrently, the buffer of the file descriptor used to report tap test output fills up. The blocked test can only progress once prove gets around to reading from that fd, presumably when another test finishes. Gah. I want my time back. I can't reproduce a similar issue on linux. But of course I'm using a newer perl, and it's likely a timing dependent issue, so it's not guaranteed to be a windows problem. But ... Greetings, Andres Freund
On 4/1/22 15:16, Andrew Dunstan wrote: > On 4/1/22 13:44, Nathan Bossart wrote: >> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote: >>> right now I am looking at a test added in the shmstats patch that's slow on >>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is >>> to figure out where things hang. >>> >>> I've hit this several times before. Of course it's not too hard to hack up >>> something printing elapsed time. But ISTM that it'd be better if we were able >>> to prefix the logging into regress_log_* with something like >>> [timestamp + time since start of test] >>> or >>> [timestamp + time since start of test + time since last log message] >>> >>> >>> This isn't just useful to figure out what parts of test are slow, but also >>> helps correlate server logs with the regress_log_* output. Which right now is >>> hard and inaccurate, requiring manually correlating statements between server >>> log and the tap test (often there's no logging for statements in the >>> regress_log_*). >> +1 >> > > Maybe one way would be to make a change in > src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would > just be to add a timestamp, the other things would involve a bit more > bookkeeping. It should also be checked to make sure it doesn't add too > much overhead, although I would be surprised if it did. > Along these lines. Untested, it clearly needs a bit of polish (e.g. a way to turn it on or off for a filehandle). We could use Time::Hires if you want higher resolution times. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 4/1/22 16:25, Andrew Dunstan wrote: > On 4/1/22 15:16, Andrew Dunstan wrote: >> On 4/1/22 13:44, Nathan Bossart wrote: >>> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote: >>>> right now I am looking at a test added in the shmstats patch that's slow on >>>> CI, on windows only. Unfortunately the regress_log_* output is useless as-is >>>> to figure out where things hang. >>>> >>>> I've hit this several times before. Of course it's not too hard to hack up >>>> something printing elapsed time. But ISTM that it'd be better if we were able >>>> to prefix the logging into regress_log_* with something like >>>> [timestamp + time since start of test] >>>> or >>>> [timestamp + time since start of test + time since last log message] >>>> >>>> >>>> This isn't just useful to figure out what parts of test are slow, but also >>>> helps correlate server logs with the regress_log_* output. Which right now is >>>> hard and inaccurate, requiring manually correlating statements between server >>>> log and the tap test (often there's no logging for statements in the >>>> regress_log_*). >>> +1 >>> >> Maybe one way would be to make a change in >> src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would >> just be to add a timestamp, the other things would involve a bit more >> bookkeeping. It should also be checked to make sure it doesn't add too >> much overhead, although I would be surprised if it did. >> > > Along these lines. Untested, it clearly needs a bit of polish (e.g. a > way to turn it on or off for a filehandle). We could use Time::Hires if > you want higher resolution times. > > Here's a version that actually works. It produces traces that look like this: andrew@emma:pg_upgrade $ grep '([0-9]*s)' tmp_check/log/regress_log_002_pg_upgrade [21:55:06](63s) ok 1 - dump before running pg_upgrade [21:55:22](79s) ok 2 - run of pg_upgrade for new instance [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade [21:55:27](84s) 1..3 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 4/2/22 06:57, Andrew Dunstan wrote: > Here's a version that actually works. It produces traces that look like > this: > > > andrew@emma:pg_upgrade $ grep '([0-9]*s)' > tmp_check/log/regress_log_002_pg_upgrade > [21:55:06](63s) ok 1 - dump before running pg_upgrade > [21:55:22](79s) ok 2 - run of pg_upgrade for new instance > [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade > [21:55:27](84s) 1..3 > I know there's a lot going on, but are people interested in this? It's a pretty small patch to produce something that seems quite useful. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 4/2/22 06:57, Andrew Dunstan wrote: >> Here's a version that actually works. It produces traces that look like >> this: >> andrew@emma:pg_upgrade $ grep '([0-9]*s)' >> tmp_check/log/regress_log_002_pg_upgrade >> [21:55:06](63s) ok 1 - dump before running pg_upgrade >> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance >> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade >> [21:55:27](84s) 1..3 > I know there's a lot going on, but are people interested in this? It's a > pretty small patch to produce something that seems quite useful. I too think that the elapsed time is useful. I'm less convinced that the time-of-day marker is useful. It also seems kind of odd that the elapsed time accumulates rather than being reset for each line. As it stands one would be doing a lot of mental subtractions rather than being able to see directly how long each step takes. I suppose that on fast machines where each step is under one second, accumulation would be more useful than printing a lot of zeroes --- but on the other hand, those aren't the cases where you're going to be terribly concerned about the runtime. regards, tom lane
Hi, On 2022-04-07 17:21:09 -0400, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > On 4/2/22 06:57, Andrew Dunstan wrote: > >> Here's a version that actually works. It produces traces that look like > >> this: > >> andrew@emma:pg_upgrade $ grep '([0-9]*s)' > >> tmp_check/log/regress_log_002_pg_upgrade > >> [21:55:06](63s) ok 1 - dump before running pg_upgrade > >> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance > >> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade > >> [21:55:27](84s) 1..3 > > > I know there's a lot going on, but are people interested in this? It's a > > pretty small patch to produce something that seems quite useful. It's been 0 days since I last wanted this. > I too think that the elapsed time is useful. I'm less convinced > that the time-of-day marker is useful. I think it'd be quite useful if it had more precision - it's a pita to correlate regress_log_* output with server logs. > It also seems kind of odd that the elapsed time accumulates rather > than being reset for each line. As it stands one would be doing a lot > of mental subtractions rather than being able to see directly how long > each step takes. I suppose that on fast machines where each step is > under one second, accumulation would be more useful than printing a > lot of zeroes --- but on the other hand, those aren't the cases where > you're going to be terribly concerned about the runtime. I like both - if you want to find where the slowdown among a lot of log lines is, it's easier to look at the time accumulated elapsed time. If you actually want to see how long individual things take, non-accumulated is more useful. I've printed both in the past... Any chance we could print higher res time? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-04-07 17:21:09 -0400, Tom Lane wrote: >> I too think that the elapsed time is useful. I'm less convinced >> that the time-of-day marker is useful. > I think it'd be quite useful if it had more precision - it's a pita to > correlate regress_log_* output with server logs. Fair point. Maybe we could keep the timestamp (with ms precision if possible) and then the parenthetical bit is time-since-last-line (also with ms precision)? I think that would more or less satisfy both uses. regards, tom lane
Hi, On 2022-04-07 17:45:09 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-04-07 17:21:09 -0400, Tom Lane wrote: > >> I too think that the elapsed time is useful. I'm less convinced > >> that the time-of-day marker is useful. > > > I think it'd be quite useful if it had more precision - it's a pita to > > correlate regress_log_* output with server logs. > > Fair point. Maybe we could keep the timestamp (with ms precision > if possible) and then the parenthetical bit is time-since-last-line > (also with ms precision)? I think that would more or less satisfy > both uses. Would work for me... Greetings, Andres Freund
On 4/7/22 17:58, Andres Freund wrote: > Hi, > > On 2022-04-07 17:45:09 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote: >>>> I too think that the elapsed time is useful. I'm less convinced >>>> that the time-of-day marker is useful. >>> I think it'd be quite useful if it had more precision - it's a pita to >>> correlate regress_log_* output with server logs. >> Fair point. Maybe we could keep the timestamp (with ms precision >> if possible) and then the parenthetical bit is time-since-last-line >> (also with ms precision)? I think that would more or less satisfy >> both uses. > Would work for me... > All doable. Time::HiRes gives us a higher resolution timer. I'll post a new version in a day or two. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 4/7/22 19:55, Andrew Dunstan wrote: > On 4/7/22 17:58, Andres Freund wrote: >> Hi, >> >> On 2022-04-07 17:45:09 -0400, Tom Lane wrote: >>> Andres Freund <andres@anarazel.de> writes: >>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote: >>>>> I too think that the elapsed time is useful. I'm less convinced >>>>> that the time-of-day marker is useful. >>>> I think it'd be quite useful if it had more precision - it's a pita to >>>> correlate regress_log_* output with server logs. >>> Fair point. Maybe we could keep the timestamp (with ms precision >>> if possible) and then the parenthetical bit is time-since-last-line >>> (also with ms precision)? I think that would more or less satisfy >>> both uses. >> Would work for me... >> > All doable. Time::HiRes gives us a higher resolution timer. I'll post a > new version in a day or two. New version attached. Sample traces: andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' regress_log_020_pg_receivewal | tail -n 15 [09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support [09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support [09:22:45.296](0.265s) ok 32 - streaming some WAL [09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete [09:22:45.298](0.001s) ok 34 - check stream dir permissions [09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming point [09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit code not 0 [09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: matches [09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn [09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived [09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from the slot's position [09:22:46.598](0.001s) ok 40 - WAL segment 00000001000000000000000B archived after timeline jump [09:22:46.598](0.000s) ok 41 - WAL segment 00000002000000000000000C archived after timeline jump [09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline jump [09:22:46.599](0.001s) 1..42 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 4/8/22 09:51, Andrew Dunstan wrote: > On 4/7/22 19:55, Andrew Dunstan wrote: >> On 4/7/22 17:58, Andres Freund wrote: >>> Hi, >>> >>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote: >>>> Andres Freund <andres@anarazel.de> writes: >>>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote: >>>>>> I too think that the elapsed time is useful. I'm less convinced >>>>>> that the time-of-day marker is useful. >>>>> I think it'd be quite useful if it had more precision - it's a pita to >>>>> correlate regress_log_* output with server logs. >>>> Fair point. Maybe we could keep the timestamp (with ms precision >>>> if possible) and then the parenthetical bit is time-since-last-line >>>> (also with ms precision)? I think that would more or less satisfy >>>> both uses. >>> Would work for me... >>> >> All doable. Time::HiRes gives us a higher resolution timer. I'll post a >> new version in a day or two. > > New version attached. > > > Sample traces: > > > andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' regress_log_020_pg_receivewal | tail -n 15 > [09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support > [09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support > [09:22:45.296](0.265s) ok 32 - streaming some WAL > [09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete > [09:22:45.298](0.001s) ok 34 - check stream dir permissions > [09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming point > [09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit code not 0 > [09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: matches > [09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn > [09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived > [09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from the slot's position > [09:22:46.598](0.001s) ok 40 - WAL segment 00000001000000000000000B archived after timeline jump > [09:22:46.598](0.000s) ok 41 - WAL segment 00000002000000000000000C archived after timeline jump > [09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline jump > [09:22:46.599](0.001s) 1..42 > > In the absence of further comment I have pushed this. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2022-04-10 09:23:16 -0400, Andrew Dunstan wrote: > In the absence of further comment I have pushed this. Thanks!