Thread: Timing overhead and Linux clock sources

Timing overhead and Linux clock sources

From

Greg Smith

Date:

06 December 2011, 22:58:28

Over in the "add timing of buffer I/O requests" thread I mentioned
having a system where EXPLAIN ANALYZE of a modest COUNT(*) takes 10X as
long as just executing the statement.  Attached is a bit of SQL and a
script that runs it multiple times that demonstrate the problem on
systems that have it.  It assumes you've installed the Dell Store 2
database.  I'd expect it to work with simpler example too (just create
something with 20,000 rows), but I haven't actually confirmed that yet.

 From what I've been able to measure so far, here's the state of things:

-If you have a system with a working TSC clock source (timing data is
pulled right from the CPU), timing overhead is reasonable enough that
you might turn it on even for things that happen frequently, such as the
buffer I/O timing patch enables.

-Some systems have tsc, but it doesn't work right in multi-core setups.
Newer kernel versions know this and disable it, older ones let you pick
tsc anyway and bad things occur.

-Using any of the other timing mechanism--hpt, acpi_pm, and jiffies--has
extremely high timing overhead.

This doesn't conflict with the comment we got from Martijn van
Oosterhout in the other thread:  "Learned the hard way while tracking
clock-skew on a multicore system.  The hpet may not be the fastest (that
would be the cpu timer), but it's the fastest (IME) that gives
guaranteed monotonic time."  There's more background about TSC, HPET,
and an interesting caveat about VMs I came across:

http://en.wikipedia.org/wiki/Time_Stamp_Counter
http://en.wikipedia.org/wiki/High_Precision_Event_Timer
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007020

I'm starting to envision what a new documentation section discussing
timer overhead would look like now, for Linux at least.  And I'm
guessing other people can simulate this problem even on hardware that
doesn't have it in the default config, simply by manually switching to
one of the slower timing methods at boot.

The test case is just doing this:

\timing
select count(*) from customers;
explain analyze select count(*) from customers;

Lots of times, with some patterns to try and avoid caching effects.

The system with the worst clock timing issues I've found, the one that
was so obvious it kicked off this investigation, is my Thinkpad T60
laptop.  The overhead is worst on the old 32-bit Ubuntu 9.04
installation I have there.  Here's some info about the system, starting
with the 4 clock sources it supports:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm jiffies tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ uname -a
Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC
2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"
$ psql -c "select version()"
  PostgreSQL 8.3.11 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real
(Ubuntu 4.3.3-5ubuntu4) 4.3.3

You change the time source by adding a stanza like this to the kernel
boot options:

clocksource=acpi_pm

Martijn said you can just cat a new clock source into the setting
instead, I haven't evaluated that yet.  Wanted to get my baselines done
in a situation where the kernel boots using the one I wanted, which
turns out to be useful for a number of reasons.  Boot time is one of the
things that varied a lot when the timer was switched.

The hpet time source is the default on my T60 system, but there are 3
others available.  Of those, it appears that only acpi_pm is a viable
alternative.  I gave up on the system booting after 15 minutes when
using jiffies.  I could boot with tsc, but everything was erratic.  I
could barely use the mouse for example.  tsc did have a much faster
clock as measured by EXPLAIN overhead though.  Here are the results, two
runs of the test program that I've de-interleaved so it's a block of
fast (count only) runs then a block of slow (with EXPLAIN ANALYZE):

Time Source:hpet

Time: 14.783 ms
Time: 9.021 ms
Time: 9.170 ms
Time: 8.634 ms
Time: 5.632 ms
Time: 5.520 ms
Time: 15.276 ms
Time: 9.477 ms
Time: 9.485 ms

Time: 70.720 ms
Time: 67.577 ms
Time: 57.959 ms
Time: 74.661 ms
Time: 73.557 ms
Time: 74.963 ms
Time: 87.898 ms
Time: 82.125 ms
Time: 81.443 ms

Time Source:acpi_pm

Time: 8.124 ms
Time: 5.020 ms
Time: 4.887 ms
Time: 8.125 ms
Time: 5.029 ms
Time: 4.881 ms

Time: 82.986 ms
Time: 82.366 ms
Time: 82.609 ms
Time: 83.089 ms
Time: 82.438 ms
Time: 82.539 ms

Time Source:tsc
Time: 8.371 ms
Time: 4.673 ms
Time: 4.901 ms
Time: 8.409 ms
Time: 4.943 ms
Time: 4.722 ms

Time: 16.436 ms
Time: 16.349 ms
Time: 16.139 ms
Time: 19.871 ms
Time: 17.175 ms
Time: 16.540 ms

There's the 10:1 possible slowdown I was talking about when using the
default hpet clock, while the faster (but fundamentally unstable) tsc
clock does better, only around a 4:1 slowdown there.

First question I considered:  is it the operating system or kernel
causing this, rather than the hardware?  I have Debian Squeeze running a
64-bit 2.6.32 kernel installed on here too.  It only has two time
sources available, with hpet being the default:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
$ uname -a
Linux meddle 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64
GNU/Linux
$ psql -c "select version()"
                                                version
------------------------------------------------------------------------------------------------------
  PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Debian 4.4.5-8) 4.4.5, 64-bit

It's not coincidence that these are the only two clock sources that
actually worked in the older kernel.  jiffies are old stuff that's been
on the deprecation path for a while, and it's eliminating TSC with this
message during boot:

[    0.168139] checking TSC synchronization [CPU#0 -> CPU#1]:
[    0.172001] Measured 617820 cycles TSC warp between CPUs, turning off
TSC clock.
[    0.172001] Marking TSC unstable due to check_tsc_sync_source failed

So that there can be problems with TSC on some hardware is well known
enough that the newer kernel tests, then rules it out as unusable when
it fails that test.

Here's the same timing results on this kernel:

Time Source:hpet
Time: 5.701 ms
Time: 3.581 ms
Time: 3.560 ms
Time: 5.696 ms
Time: 3.587 ms
Time: 3.342 ms

Time: 43.492 ms
Time: 43.227 ms
Time: 43.319 ms
Time: 43.503 ms
Time: 43.332 ms
Time: 43.244 ms

Time Source:acpi_pm
Time: 6.350 ms
Time: 3.792 ms
Time: 3.490 ms
Time: 5.806 ms
Time: 3.672 ms
Time: 3.538 ms

Time: 80.093 ms
Time: 79.875 ms
Time: 79.789 ms
Time: 80.033 ms
Time: 80.330 ms
Time: 80.062 ms

That's much better in every way on hpet.  But the time ratio hasn't
really improved; it's still at least 10:1.  Unless there's some blacker
magic in an even newer kernel, it seems nothing I can do to this laptop
will make it run gettimeofday() quickly.

Let's move onto something that's actually a usable server.  My main
desktop has an Intel i7-860.  It has the same 64-bit Debian Squeeze
setup as the second set of T60 results I just showed.  The only thing
different about the two systems is the hardware, software is as close to
identical as I could make it.  (That's for my own general sanity, not
something special I did for this test--all my systems have that same
Debian package set)

But there are different clock options available:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ uname -a
Linux grace 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64
GNU/Linux
$ psql -c "select version()"
  PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Debian 4.4.5-8) 4.4.5, 64-bit

With tsc being the default clock source here.  A passing test of it
looks like this:

Here's some related dmesg output:

[    0.000000] Fast TSC calibration using PIT
...
     0.244640] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
...
[    0.891992] checking TSC synchronization [CPU#0 -> CPU#7]: passed.
[    0.911993] Brought up 8 CPUs

And here's how it performs:

Time Source:tsc
Time: 2.915 ms
Time: 1.531 ms
Time: 1.500 ms
Time: 2.926 ms
Time: 1.518 ms
Time: 1.478 ms

Time: 2.945 ms
Time: 2.915 ms
Time: 2.882 ms
Time: 2.920 ms
Time: 2.862 ms
Time: 2.856 ms

That's more like it.

Next up I plan to survey a few AMD systems and a newer laptop, but I'm
out of time for this tonight.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Attachment

Re: Timing overhead and Linux clock sources

From

Robert Haas

Date:

06 December 2011, 23:20:48

On Tue, Dec 6, 2011 at 9:58 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> -If you have a system with a working TSC clock source (timing data is pulled
> right from the CPU), timing overhead is reasonable enough that you might
> turn it on even for things that happen frequently, such as the buffer I/O
> timing patch enables.

Even the TSC stuff looks expensive enough that you wouldn't to pay the
full overhead all the time on a busy system, but of course we probably
wouldn't want to do that anyway.  EXPLAIN ANALYZE is extremely
expensive mostly because it's timing entry and exit into every plan
node, and the way our executor works, those are very frequent
operations.  But you could probably gather more coarse-grained
statistics, like separating parse, plan, and execute time for each
query, without breaking a sweat.  I'm not sure about buffer I/Os - on
a big sequential scan, you might do quite a lot of those in a pretty
tight loop.  That's not an argument against adding the option, though,
assuming that the default setting is off.  And, certainly, I agree
with you that it's worth trying to document some of this stuff so that
people don't have to try to figure it out themselves (uggh!).

One random thought: I wonder if there's a way for us to just time
every N'th event or something like that, to keep the overhead low.
The problem is that you might not get accurate results if, say, every
2N'th event takes much longer than normal - you'll either hit all the
long ones, or miss them all.  You could "fix" that by using a
pseudorandom number generator to decide whether to time each event,
but that's got it's own overhead...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timing overhead and Linux clock sources

From

Greg Smith

Date:

07 December 2011, 03:40:51

On 12/06/2011 10:20 PM, Robert Haas wrote:
> EXPLAIN ANALYZE is extremely
> expensive mostly because it's timing entry and exit into every plan
> node, and the way our executor works, those are very frequent
> operations.

The plan for the query I was timing looks like this:
 Aggregate  (cost=738.00..738.01 rows=1 width=0) (actual 
time=3.045..3.045 rows=1 loops=1)   ->  Seq Scan on customers  (cost=0.00..688.00 rows=20000 width=0) 
(actual time=0.002..1.700 rows=20000 loops=1)

That's then 20000 * 2 timing calls for the Seq Scan dominating the 
runtime.  On the system with fast TSC, the fastest execution was 
1.478ms, the slowest with timing 2.945ms.  That's 1.467ms of total 
timing overhead, worst-case, so approximately 37ns per timing call.  If 
you're executing something that is only ever hitting data in 
shared_buffers, you can measure that; any other case, probably not.

Picking apart the one with slow timing class on my laptop, fastest is 
5.52ms, and the fastest with timing is 57.959ms.  That makes for a 
minimum of 1311ns per timing call, best-case.

> I'm not sure about buffer I/Os - on a big sequential scan, you might do quite a lot of those in a pretty
> tight loop.

To put this into perspective relative to the number of explain time 
calls, there are 488 pages in the relation my test is executing again.

I think we need to be careful to keep timing calls from happening at 
every buffer allocation.  I wouldn't expect sprinkling one around every 
buffer miss would be a problem on a system with a fast clocksource.  And 
that is what was shown by the testing Ants Aasma did before submitting 
the "add timing of buffer I/O requests" patch; his results make more 
sense to me now.  He estimated 22ns per gettimeofday on the system with 
fast timing calls--presumably using TSC, and possibly faster than I saw 
because his system had less cores than mine to worry about.  He got 990 
ns on his slower system, and a worst case there of 3% overhead.

Whether people who are on one of these slower timing call systems would 
be willing to pay 3% overhead is questionable.  But I now believe Ants's 
claim that it's below the noise level on ones with a good TSC driven 
timer.  I got a 35:1 ratio between fast and slow clock sources, he got 
45:1.  If we try to estimate the timing overhead that is too small to 
measure, I'd guess it's ~3% / >30 = <0.1%.  I'd just leave that on all 
the time on a good TSC-driven system.  Buffer hits and tuple-level 
operations you couldn't afford to time, just about anything else would 
be fine.

> One random thought: I wonder if there's a way for us to just time
> every N'th event or something like that, to keep the overhead low.
>    

I'm predicting we got a lot of future demand for instrumentation 
features like this, where we want to make them available but would like 
to keep them from happening too often when the system is busy.  Tossing 
a percentage of them might work.  Caching them in queue somewhere for 
processing by a background process, and not collecting the data, if that 
queue fills is another idea I've been thinking about recently.  I'm 
working on some ideas for making "is the server busy?" something you can 
ask the background writer usefully too.  There's a number of things that 
become practical for that process to do, when it's decoupled from doing 
the checkpoint sync job so its worst-case response time is expected to 
tighten up.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Timing overhead and Linux clock sources

From

Ants Aasma

Date:

09 December 2011, 19:49:00

On Wed, Dec 7, 2011 at 9:40 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>  He estimated 22ns per gettimeofday on the system with fast timing
> calls--presumably using TSC, and possibly faster than I saw because his
> system had less cores than mine to worry about.  He got 990 ns on his slower
> system, and a worst case there of 3% overhead.

Roberts comment about sequential scan performing lots of reads in a tight loop
made me think of worse worst case. A count(*) with wide rows and/or lots of
bloat. I created a test table with one tuple per page like this:
CREATE TABLE io_test WITH (fillfactor=10) AS
    SELECT repeat('x', 1000) FROM generate_series(1,30000);
I then timed SELECT COUNT(*) FROM io_test; with track_iotiming on and
off. Averages of 1000 executions, differences significant according to t-test:
timer | iotiming=off |  iotiming=on | diff
 hpet |     86.10 ms |    147.80 ms | 71.67%
  tsc |     85.86 ms |     87.66 ms |  2.10%

The attached test program (test_gettimeofday_monotonic) shows that one
test loop iteration takes 34ns with tsc and 1270ns with hpet.

I also managed to run the test program a couple of two socket Solaris 10
machines. The one with Xeon X5570 had iteration time of 220ns and Xeon
E5620 had 270ns iterations. I'm not sure yet whether this is due to Solaris
gettimeofday just being slower or some hardware issue.

I also tested a more reasonable case of count(*) on pgbench_accounts with
scale factor 50 (61 tuples per page). With tsc timing was actually 1% faster,
but not statistically significant, with hpet the overhead was 5.6%.

Scaling the overhead for the Solaris machines, it seems that the worst case
for timing all buffer reads would be about 20% slower. Count(*) on medium
length tuples for an out of shared buffers, in OS cache would have overhead
between 1 and 2%.

>> One random thought: I wonder if there's a way for us to just time
>> every N'th event or something like that, to keep the overhead low.

This would work only for cases where there's a reasonably uniform distribution
of times or really long sampling periods, otherwise the variability will make
the result pretty much useless. For example in the I/O case, a pretty typical
load can have 1% of timings be 3 orders of magnitude longer than median.

--
Ants Aasma

Attachment

test_gettimeofday_monotonic.c

Re: Timing overhead and Linux clock sources

From

Greg Smith

Date:

09 December 2011, 21:26:33

On 12/09/2011 06:48 PM, Ants Aasma wrote:
> The attached test program (test_gettimeofday_monotonic) shows that one
> test loop iteration takes 34ns with tsc and 1270ns with hpet.
>    

This test program is great, I've wanted this exact sort of visibility 
into this problem for years.  I've toyed with writing something like 
this and then seeing what results it returns on all of the build farm 
animals.  For now I can just run it on all the hardware I have access 
to, which is not a small list.

I think we should bundle this up similarly to test_fsync, document some 
best practices on time sources, and then point the vague warning about 
EXPLAIN overhead toward that.  Then new sources of timing overhead can 
point there too.   Much like low-level fsync timing, there's nothing 
PostgreSQL can do about it, the best we can do is provide a program to 
help users classify a system as likely or unlikely to run to have 
high-overhead timing.  I can make the needed docs changes, this is 
resolving a long standing issue impacting code I wanted to add.

Rough guideline I'm seeing shape up is that <50ns is unlikely to cause 
clock timing to be a significant problem, >500ns certainly is, and 
values in the middle should concern people but not necessarily 
invalidate the timing data collected.

I just confirmed that switching the clock source by echoing a new value 
into /sys/devices/system/clocksource/clocksource0/current_clocksource 
works (on the 2.6.32 kernel at least).  What we want to see on a good 
server is a result that looks like this, from my Intel i7-860 system:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$ ./test_gettimeofday_monotonic 5
Per loop: 39.30 ns     usec:      count   percent        4:          6  0.00000%        2:        104  0.00008%
1:   4999760  3.92983%        0:  122226109 96.07009%

Here's how badly that degrades if I use one of the alternate time sources:

# echo acpi_pm > 
/sys/devices/system/clocksource/clocksource0/current_clocksource
$ ./test_gettimeofday_monotonic 5
Per loop: 727.65 ns     usec:      count   percent       16:          1  0.00001%        8:          0  0.00000%
4:      1233  0.01794%        2:        699  0.01017%        1:    4992622 72.65764%        0:    1876879 27.31423%

echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
$ ./test_gettimeofday_monotonic 5
Per loop: 576.96 ns     usec:      count   percent        8:          2  0.00002%        4:       1273  0.01469%
2:       767  0.00885%        1:    4993028 57.61598%        0:    3670977 42.36046%

Switching to the Intel T7200 CPU in my T60 laptop only provides the poor 
quality time sources, not TSC, and results show timing is really slow there:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ ./test_gettimeofday_monotonic 5
Per loop: 1019.60 ns     usec:      count   percent      256:          2  0.00004%      128:          3  0.00006%
64:        90  0.00184%       32:         23  0.00047%       16:         92  0.00188%        8:       1246  0.02541%
   4:         34  0.00069%        2:     136154  2.77645%        1:    4700772 95.85818%        0:      65466
1.33498%

# echo acpi_pm > 
/sys/devices/system/clocksource/clocksource0/current_clocksource
$ ./test_gettimeofday_monotonic 5
Per loop: 1864.66 ns     usec:      count   percent      256:          2  0.00007%      128:          0  0.00000%
64:         3  0.00011%       32:          6  0.00022%       16:         90  0.00336%        8:       1741  0.06493%
   4:       2062  0.07690%        2:    2260601 84.30489%        1:     416954 15.54952%        0:          0
0.00000%

This seems to be measuring exactly the problem I only had a hand-wave 
"it's bad on this hardware" explanation of before.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

17 August 2012, 02:28:25

FYI, I am planning to go ahead and package this tool in /contrib for PG
9.3.

---------------------------------------------------------------------------

On Fri, Dec  9, 2011 at 08:26:12PM -0500, Greg Smith wrote:
> On 12/09/2011 06:48 PM, Ants Aasma wrote:
> >The attached test program (test_gettimeofday_monotonic) shows that one
> >test loop iteration takes 34ns with tsc and 1270ns with hpet.
> 
> This test program is great, I've wanted this exact sort of
> visibility into this problem for years.  I've toyed with writing
> something like this and then seeing what results it returns on all
> of the build farm animals.  For now I can just run it on all the
> hardware I have access to, which is not a small list.
> 
> I think we should bundle this up similarly to test_fsync, document
> some best practices on time sources, and then point the vague
> warning about EXPLAIN overhead toward that.  Then new sources of
> timing overhead can point there too.   Much like low-level fsync
> timing, there's nothing PostgreSQL can do about it, the best we can
> do is provide a program to help users classify a system as likely or
> unlikely to run to have high-overhead timing.  I can make the needed
> docs changes, this is resolving a long standing issue impacting code
> I wanted to add.
> 
> Rough guideline I'm seeing shape up is that <50ns is unlikely to
> cause clock timing to be a significant problem, >500ns certainly is,
> and values in the middle should concern people but not necessarily
> invalidate the timing data collected.
> 
> I just confirmed that switching the clock source by echoing a new
> value into
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> works (on the 2.6.32 kernel at least).  What we want to see on a
> good server is a result that looks like this, from my Intel i7-860
> system:
> 
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> tsc
> $ ./test_gettimeofday_monotonic 5
> Per loop: 39.30 ns
>      usec:      count   percent
>         4:          6  0.00000%
>         2:        104  0.00008%
>         1:    4999760  3.92983%
>         0:  122226109 96.07009%
> 
> Here's how badly that degrades if I use one of the alternate time sources:
> 
> # echo acpi_pm >
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> $ ./test_gettimeofday_monotonic 5
> Per loop: 727.65 ns
>      usec:      count   percent
>        16:          1  0.00001%
>         8:          0  0.00000%
>         4:       1233  0.01794%
>         2:        699  0.01017%
>         1:    4992622 72.65764%
>         0:    1876879 27.31423%
> 
> echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
> $ ./test_gettimeofday_monotonic 5
> Per loop: 576.96 ns
>      usec:      count   percent
>         8:          2  0.00002%
>         4:       1273  0.01469%
>         2:        767  0.00885%
>         1:    4993028 57.61598%
>         0:    3670977 42.36046%
> 
> Switching to the Intel T7200 CPU in my T60 laptop only provides the
> poor quality time sources, not TSC, and results show timing is
> really slow there:
> 
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> hpet
> $ ./test_gettimeofday_monotonic 5
> Per loop: 1019.60 ns
>      usec:      count   percent
>       256:          2  0.00004%
>       128:          3  0.00006%
>        64:         90  0.00184%
>        32:         23  0.00047%
>        16:         92  0.00188%
>         8:       1246  0.02541%
>         4:         34  0.00069%
>         2:     136154  2.77645%
>         1:    4700772 95.85818%
>         0:      65466  1.33498%
> 
> # echo acpi_pm >
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> $ ./test_gettimeofday_monotonic 5
> Per loop: 1864.66 ns
>      usec:      count   percent
>       256:          2  0.00007%
>       128:          0  0.00000%
>        64:          3  0.00011%
>        32:          6  0.00022%
>        16:         90  0.00336%
>         8:       1741  0.06493%
>         4:       2062  0.07690%
>         2:    2260601 84.30489%
>         1:     416954 15.54952%
>         0:          0  0.00000%
> 
> This seems to be measuring exactly the problem I only had a
> hand-wave "it's bad on this hardware" explanation of before.
> 
> -- 
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Timing overhead and Linux clock sources

From

Robert Haas

Date:

20 August 2012, 19:11:55

On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian <bruce@momjian.us> wrote:
> FYI, I am planning to go ahead and package this tool in /contrib for PG
> 9.3.

Isn't this exactly what we already did, in 9.2, in the form of
contrib/pg_test_timing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

26 August 2012, 02:49:10

On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote:
> On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > FYI, I am planning to go ahead and package this tool in /contrib for PG
> > 9.3.
>
> Isn't this exactly what we already did, in 9.2, in the form of
> contrib/pg_test_timing?

Sorry, not sure how I missed that commit.  Anyway, I am attaching a
patch for 9.3 that I think improves the output of the tool, plus adds
some C comments.

The new output has the lowest duration times first:

    Testing timing overhead for 3 seconds.
    Per loop time including overhead: 41.31 nsec
    Histogram of timing durations:
    < usec   % of total      count
         1     95.87135   69627856
         2      4.12759    2997719
         4      0.00086        628
         8      0.00018        133
        16      0.00001          5
        32      0.00000          1

This should make the output clearer to eyeball for problems --- a good
timing has a high percentage on the first line, rather than on the last
line.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

timing.diff

Re: Timing overhead and Linux clock sources

From

Robert Haas

Date:

27 August 2012, 16:39:07

On Sat, Aug 25, 2012 at 10:48 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote:
>> On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > FYI, I am planning to go ahead and package this tool in /contrib for PG
>> > 9.3.
>>
>> Isn't this exactly what we already did, in 9.2, in the form of
>> contrib/pg_test_timing?
>
> Sorry, not sure how I missed that commit.  Anyway, I am attaching a
> patch for 9.3 that I think improves the output of the tool, plus adds
> some C comments.
>
> The new output has the lowest duration times first:
>
>         Testing timing overhead for 3 seconds.
>         Per loop time including overhead: 41.31 nsec
>         Histogram of timing durations:
>         < usec   % of total      count
>              1     95.87135   69627856
>              2      4.12759    2997719
>              4      0.00086        628
>              8      0.00018        133
>             16      0.00001          5
>             32      0.00000          1
>
> This should make the output clearer to eyeball for problems --- a good
> timing has a high percentage on the first line, rather than on the last
> line.

I guess I'm not sure the output format is an improvement.  I wouldn't
care much one way or the other if we had made this change at the time
in AS92, but I'm not sure it's really worth breaking compatibility for
a format that may or may not be any better.  The person who wrote the
original code presumably preferred it way it already is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

27 August 2012, 17:19:05

On Mon, Aug 27, 2012 at 12:39:02PM -0400, Robert Haas wrote:
> On Sat, Aug 25, 2012 at 10:48 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote:
> >> On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> > FYI, I am planning to go ahead and package this tool in /contrib for PG
> >> > 9.3.
> >>
> >> Isn't this exactly what we already did, in 9.2, in the form of
> >> contrib/pg_test_timing?
> >
> > Sorry, not sure how I missed that commit.  Anyway, I am attaching a
> > patch for 9.3 that I think improves the output of the tool, plus adds
> > some C comments.
> >
> > The new output has the lowest duration times first:
> >
> >         Testing timing overhead for 3 seconds.
> >         Per loop time including overhead: 41.31 nsec
> >         Histogram of timing durations:
> >         < usec   % of total      count
> >              1     95.87135   69627856
> >              2      4.12759    2997719
> >              4      0.00086        628
> >              8      0.00018        133
> >             16      0.00001          5
> >             32      0.00000          1
> >
> > This should make the output clearer to eyeball for problems --- a good
> > timing has a high percentage on the first line, rather than on the last
> > line.
> 
> I guess I'm not sure the output format is an improvement.  I wouldn't
> care much one way or the other if we had made this change at the time
> in AS92, but I'm not sure it's really worth breaking compatibility for
> a format that may or may not be any better.  The person who wrote the
> original code presumably preferred it way it already is.

He wrote it that way to allow for simpler C code --- he could just start
from 31 and keeping skipping entries until he hit a non-zero.

My format makes it easy to see which line should have the majority of
the entries, e.g. first line should be > 90%.  I doubt there are enough
people running this cross-version that consistency in output makes any
difference between major PG versions.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

27 August 2012, 17:29:23

On Mon, Aug 27, 2012 at 01:18:51PM -0400, Bruce Momjian wrote:
> > > This should make the output clearer to eyeball for problems --- a good
> > > timing has a high percentage on the first line, rather than on the last
> > > line.
> > 
> > I guess I'm not sure the output format is an improvement.  I wouldn't
> > care much one way or the other if we had made this change at the time
> > in AS92, but I'm not sure it's really worth breaking compatibility for
> > a format that may or may not be any better.  The person who wrote the
> > original code presumably preferred it way it already is.
> 
> He wrote it that way to allow for simpler C code --- he could just start
> from 31 and keeping skipping entries until he hit a non-zero.
> 
> My format makes it easy to see which line should have the majority of
> the entries, e.g. first line should be > 90%.  I doubt there are enough
> people running this cross-version that consistency in output makes any
> difference between major PG versions.

The real weird part is that this tool outputs a variable number of
rows/buckets, depending on the hightest non-zero bucket, so I had
trouble understanding it when the last line was the one to look at,
especially for multiple runs.

Also, we heavily adjusted the output of pg_test_fsync for several major
releases and that wasn't a problem for anyone.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Timing overhead and Linux clock sources

From

Robert Haas

Date:

27 August 2012, 20:42:38

On Mon, Aug 27, 2012 at 1:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
> He wrote it that way to allow for simpler C code --- he could just start
> from 31 and keeping skipping entries until he hit a non-zero.
>
> My format makes it easy to see which line should have the majority of
> the entries, e.g. first line should be > 90%.  I doubt there are enough
> people running this cross-version that consistency in output makes any
> difference between major PG versions.

I don't see why it's better for the first line to have a big number
than the last line.  What difference does it make?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

27 August 2012, 22:20:39

On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote:
> On Mon, Aug 27, 2012 at 1:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > He wrote it that way to allow for simpler C code --- he could just start
> > from 31 and keeping skipping entries until he hit a non-zero.
> >
> > My format makes it easy to see which line should have the majority of
> > the entries, e.g. first line should be > 90%.  I doubt there are enough
> > people running this cross-version that consistency in output makes any
> > difference between major PG versions.
> 
> I don't see why it's better for the first line to have a big number
> than the last line.  What difference does it make?

When you are looking at that output, the <1 usec is where you want to
see most of the percentage, and it trails off after that.

Here is an example from the current output format:
Histogram of timing durations:   < usec:      count   percent       16:          3  0.00007%        8:        563
0.01357%       4:       3241  0.07810%        2:    2990371 72.05956%        1:    1155682 27.84870%
 

That first line is pretty meaningless.  You have to look at the last
line, see that only 27% of <1, then look up to see that 72% is 1<2,
which isn't good.  My format shows:
< usec   % of total      count     1     27.84870    1155682     2     72.05956    2990371     4      0.07810
3241    8      0.01357        563    16      0.00007          3
 

First line, 27%, that's a problem, look down for more details.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Timing overhead and Linux clock sources

From

Greg Smith

Date:

28 August 2012, 03:13:08

On 08/27/2012 06:20 PM, Bruce Momjian wrote:
> On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote:
>> I don't see why it's better for the first line to have a big number
>> than the last line.  What difference does it make?
>
> When you are looking at that output, the<1 usec is where you want to
> see most of the percentage, and it trails off after that.

After staring at all the examples I generated again, I think Bruce is 
right that the newer format he's suggesting is better.  I know I never 
thought about whether reordering for easier interpretation made sense 
before, and I'd also guess "it was less coding" for the existing order 
was the only reason Ants did it that way.

Where I think this is a most useful improvement is when comparing two 
systems with different results that don't end at the same boundary.  The 
proposed formatting would show the good vs. bad examples I put in the 
docs like this:
   < usec:      count   percent        1:   27694571 90.23734%        2:    2993204  9.75277%        4:       3010
0.00981%       8:         22  0.00007%       16:          1  0.00000%       32:          1  0.00000%

   < usec:      count   percent        1:    1155682 27.84870%        2:    2990371 72.05956%        4:       3241
0.07810%       8:        563  0.01357%       16:          3  0.00007%

And I think it's easier to compare those two in that order.  The docs do 
specifically suggest staring at the <1 usec numbers first, and having 
just mocked up both orders I do prefer this one for that job.  The way 
this was originally written, it's easier to come to an initially 
misleading conclusion.  The fact that the first system sometimes spikes 
to the 32 usec range is the first thing that jumps out at you in the 
originally committed ordering, and that's not where people should focus 
first.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

28 August 2012, 03:16:34

On Mon, Aug 27, 2012 at 11:13:00PM -0400, Greg Smith wrote:
> On 08/27/2012 06:20 PM, Bruce Momjian wrote:
> >On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote:
> >>I don't see why it's better for the first line to have a big number
> >>than the last line.  What difference does it make?
> >
> >When you are looking at that output, the<1 usec is where you want to
> >see most of the percentage, and it trails off after that.
> 
> After staring at all the examples I generated again, I think Bruce
> is right that the newer format he's suggesting is better.  I know I
> never thought about whether reordering for easier interpretation
> made sense before, and I'd also guess "it was less coding" for the
> existing order was the only reason Ants did it that way.
> 
> Where I think this is a most useful improvement is when comparing
> two systems with different results that don't end at the same
> boundary.  The proposed formatting would show the good vs. bad
> examples I put in the docs like this:
> 
>    < usec:      count   percent
>         1:   27694571 90.23734%
>         2:    2993204  9.75277%
>         4:       3010  0.00981%
>         8:         22  0.00007%
>        16:          1  0.00000%
>        32:          1  0.00000%
> 
>    < usec:      count   percent
>         1:    1155682 27.84870%
>         2:    2990371 72.05956%
>         4:       3241  0.07810%
>         8:        563  0.01357%
>        16:          3  0.00007%
> 
> And I think it's easier to compare those two in that order.  The
> docs do specifically suggest staring at the <1 usec numbers first,
> and having just mocked up both orders I do prefer this one for that
> job.  The way this was originally written, it's easier to come to an
> initially misleading conclusion.  The fact that the first system
> sometimes spikes to the 32 usec range is the first thing that jumps
> out at you in the originally committed ordering, and that's not
> where people should focus first.

Yes, I was totally confused how you would explain how to analyze this. 
Once the patch is applied you might find it easier to clearly state
that in the docs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Timing overhead and Linux clock sources

From

Ants Aasma

Date:

28 August 2012, 08:26:12

On Tue, Aug 28, 2012 at 6:13 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> After staring at all the examples I generated again, I think Bruce is right
> that the newer format he's suggesting is better.  I know I never thought
> about whether reordering for easier interpretation made sense before, and
> I'd also guess "it was less coding" for the existing order was the only
> reason Ants did it that way.

That is correct, I didn't really have a preference for the order so I
went for simpler is better.

I agree that the order suggested by Bruce looks better. Having seen
more measurements I'd have to say that the histogram is mostly
useless. It seems to me that it mostly shows OS scheduling noise. I
would even say that the histogram output should be hidden behind an
command line option to avoid unnecessary confusion.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Timing overhead and Linux clock sources

From

Robert Haas

Date:

28 August 2012, 14:43:12

On Mon, Aug 27, 2012 at 11:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> After staring at all the examples I generated again, I think Bruce is right
> that the newer format he's suggesting is better.

OK by me, then.  If you're happy, I'm happy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timing overhead and Linux clock sources

From

Bruce Momjian

Date:

28 August 2012, 16:59:43

On Tue, Aug 28, 2012 at 10:43:07AM -0400, Robert Haas wrote:
> On Mon, Aug 27, 2012 at 11:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> > After staring at all the examples I generated again, I think Bruce is right
> > that the newer format he's suggesting is better.
> 
> OK by me, then.  If you're happy, I'm happy.

OK, applied.  The general rule I followed here is that if a report has a
fixed start value and a variable stop value, the start value should be
reported first.  When comparing runs, any missing (ending) rows are seen
visually as empty (at the end of the output but blank).  This isn't a
common case, so this issue doesn't come up often.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +