Thread: gettimeofday is at the end of its usefulness?
A recent question from Tim Kane prompted me to measure the overhead costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things are far worse than I thought. On my current server (by no means lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run at something like 110 nsec per row: regression=# create table foo as select x as f1 from generate_series(1,1000000) x; SELECT 1000000 regression=# vacuum foo; VACUUM regression=# explain analyze select * from foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.053..111.720 rows=1000000 loops=1) Planning time: 0.222 ms Execution time: 166.682 ms (3 rows) (and, btw, this is a debug build --- without assert and memory context checks it'd be faster.) The problem with this number is that a simple test program shows that gettimeofday() requires about 40 nsec on this hardware. That means that two-thirds of the above timing measurement is overhead. To add insult to injury, gettimeofday's output cannot be more precise than 1 microsecond, making its use for measuring sub-microsecond intervals at best stochastic. I looked around a bit and found that recent versions of POSIX have a function clock_gettime() that's a bit more modern than gettimeofday(): at least, the output struct provides nsec rather than usec available precision. I benchmarked this, with the CLOCK_REALTIME selector, and found that it also requires about 40nsec, while the output is actually good to perhaps 10nsec precision. (I base this on seeing no duplicate readings in a tight loop, so that the value is certainly getting advanced more often than once every 40 nsec.) There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster --- about 10nsec for me --- but the output appears to only advance once every millisecond, so it's probably useless for our purposes. The other selectors mentioned in the Linux man page are considerably slower than CLOCK_REALTIME for me, suggesting that they actually call into the kernel. I also tried a loop around a bare "rdtsc" assembly instruction, finding that that instruction takes about 10nsec. That would be a nice improvement over gettimeofday, except that using that directly would involve dealing with cross-CPU skew, which seems like no fun at all. And I don't really want to get into finding equivalents for non-Intel architectures, either. Anyway it looks like clock_gettime() might be worth using on Linux just for the more precise output. It doesn't seem to exist on OS X though, and I have no idea about elsewhere. I'm curious if anybody has ideas about other things we might do for portable high-precision timing. regards, tom lane #include <stdio.h> #include <time.h> #include <sys/time.h> int main(int argc, char **argv) { long thistime, lasttime = 0; int nbogus = 0; int ndistinct = 0; int i; for (i=0; i<1000000000; i++) { #if 1 struct timespec tv; clock_gettime(CLOCK_REALTIME, &tv); thistime = (long) tv.tv_sec * 1000000000 + tv.tv_nsec; #else struct timeval tv; gettimeofday(&tv, NULL); thistime = (long) tv.tv_sec * 1000000 + tv.tv_usec; #endif if (thistime < lasttime) nbogus++; else if (thistime > lasttime) ndistinct++; lasttime = thistime; } printf("%d bogus readings\n", nbogus); printf("%d distinct readings\n", ndistinct); return 0; }
On Tue, May 13, 2014 at 3:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster > --- about 10nsec for me --- but the output appears to only advance once > every millisecond, so it's probably useless for our purposes. The other > selectors mentioned in the Linux man page are considerably slower than > CLOCK_REALTIME for me, suggesting that they actually call into the kernel. What Linux kernel version is in use here? Apparently, as I think you've stated another way, more recent versions have VDSO for this, which can make a big difference. This article seems like a sensible guide to all of this: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html CLOCK_REALTIME_COARSE seemingly influences precision in a way that allows user space applications to decide on their precision/cost trade-off, rather than being forced to use the system default (that procfs surfaces) through gettimeofday(): http://lwn.net/Articles/342018/ I can see a benefit in exposing this trade-off to Postgres code directly. I still think that a correlated reference period will prove useful, and while there are a number of ways to amortize the cost of repeatedly (coarsely) getting the wall time in the ordinary course of choosing victim buffers, it would be nice to do this too. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Tue, May 13, 2014 at 3:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There's also a CLOCK_REALTIME_COARSE selector, which is noticeably faster >> --- about 10nsec for me --- but the output appears to only advance once >> every millisecond, so it's probably useless for our purposes. The other >> selectors mentioned in the Linux man page are considerably slower than >> CLOCK_REALTIME for me, suggesting that they actually call into the kernel. > What Linux kernel version is in use here? Ah, sorry, I should have specified. This is RHEL6.5, current kernel version 2.6.32-431.17.1.el6.x86_64. > Apparently, as I think > you've stated another way, more recent versions have VDSO for this, > which can make a big difference. This article seems like a sensible > guide to all of this: > https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html This appears to be talking about RHEL5, which is quite a bit older (and, I'd guess, trailing edge for anybody who might deploy PG 9.5). I did confirm that /proc/sys/kernel/vsyscall64 exists and has a default setting of 1 on RHEL6. Setting it to 0 causes gettimeofday to take 150ns, which probably represents the time for a trivial kernel call. The MRG extension described on the linked page doesn't seem to be implemented in stock RHEL6 (setting vsyscall64 to 2 is allowed but doesn't change behavior compared to 1). However, if I'm reading it right, all that does is make gettimeofday behave like clock_gettime(CLOCK_REALTIME_COARSE). > CLOCK_REALTIME_COARSE seemingly influences precision in a way that > allows user space applications to decide on their precision/cost > trade-off, rather than being forced to use the system default (that > procfs surfaces) through gettimeofday(): > http://lwn.net/Articles/342018/ Yeah, I think these are the same implementations exposed to apps in two different ways, one being a system-wide switch affecting gettimeofday() while the other allows the app source code to say which one it wants. regards, tom lane
On Tue, May 13, 2014 at 11:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I also tried a loop around a bare "rdtsc" assembly instruction, finding > that that instruction takes about 10nsec. That would be a nice > improvement over gettimeofday, except that using that directly would > involve dealing with cross-CPU skew, which seems like no fun at all. > And I don't really want to get into finding equivalents for non-Intel > architectures, either. I always assumed the kernel used rdtsc to implement some of the high performance timers. It can save the current time in a mapped page when it schedules a process and then in the vdso syscall (ie in user-space) it can use rdtsc to calculate the offset needed to adjust that timestamp to the current time. This seems consistent with your calculations that showed the 40ns overhead with +/- 10ns precision. I actually think it would be more interesting if we could measure the overhead and adjust for it. I don't think people are really concerned with how long EXPLAIN ANALYZE takes to run if they could get accurate numbers out of it. Other profiling tools I poked at in the past ran a tight loop around the profiling code to estimate the time it actually took and then subtracted that from all the measurements. I think that might work for the actual clock_gettime overhead. If we did that then we could call it twice and measure the time spent in the rest of the EXPLAIN ANALYZE code and subtract that plus the time for the two clock_gettimes from the run-time... -- greg
Greg Stark <stark@mit.edu> writes: > I actually think it would be more interesting if we could measure the > overhead and adjust for it. Actually, that's quite a good thought. The overhead should be a pretty stable number on any given machine, so in theory we could do this to high precision. And the numbers I just showed say that on current x86_64 platforms, the *best we could possibly hope for* in terms of direct overhead reduction is about 4x. Which is something, but it hardly makes the problem vanish. I have a vague feeling that we discussed subtract-the-overhead once before and thought it wasn't necessary yet. Maybe it's time. But we also need to be using something that gives better than 1usec resolution. So I'm still thinking we should use clock_gettime() where available, and look for alternative APIs where not. regards, tom lane
On Tuesday, May 13, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:
A recent question from Tim Kane prompted me to measure the overhead
costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
are far worse than I thought. On my current server (by no means
lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
at something like 110 nsec per row:
regression=# create table foo as select x as f1 from generate_series(1,1000000) x;
SELECT 1000000
regression=# vacuum foo;
VACUUM
regression=# explain analyze select * from foo;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.053..111.720 rows=1000000 loops=1)
Planning time: 0.222 ms
Execution time: 166.682 ms
(3 rows)
(and, btw, this is a debug build --- without assert and memory
context checks it'd be faster.)
The problem with this number is that a simple test program shows that
gettimeofday() requires about 40 nsec on this hardware. That means
that two-thirds of the above timing measurement is overhead.
I'm all for finding something better if we can, but in the mean time this is certainly not unexpected, and isn't it exactly what "explain (analyze,timing off)" was invented for?
Cheers,
Jeff
On Wed, May 14, 2014 at 6:34 AM, Greg Stark <stark@mit.edu> wrote: > I always assumed the kernel used rdtsc to implement some of the high > performance timers. It can save the current time in a mapped page when > it schedules a process and then in the vdso syscall (ie in user-space) > it can use rdtsc to calculate the offset needed to adjust that > timestamp to the current time. This seems consistent with your > calculations that showed the 40ns overhead with +/- 10ns precision. Both gettimeofday and clock_gettime do exactly that. [1] clock_gettime(CLOCK_MONOTONIC) is the mode of operation we would want to use here. > I actually think it would be more interesting if we could measure the > overhead and adjust for it. I don't think people are really concerned > with how long EXPLAIN ANALYZE takes to run if they could get accurate > numbers out of it. Measuring would also be a good idea so we can automatically turn on performance counters like IO timing when we know it's not obscenely expensive. However, subtracting the overhead will still skew the numbers somewhat by giving more breathing time for memory and IO prefetching mechanisms. Another option to consider would be to add a sampling based mechanism for low overhead time attribution. It would be even better if we could distinguish between time spent waiting on locks vs. waiting on IO vs. waiting to be scheduled vs. actually executing. [1] https://github.com/torvalds/linux/blob/master/arch/x86/vdso/vclock_gettime.c#L223 Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote: > I always assumed the kernel used rdtsc to implement some of the high > performance timers. It can save the current time in a mapped page when > it schedules a process and then in the vdso syscall (ie in user-space) > it can use rdtsc to calculate the offset needed to adjust that > timestamp to the current time. This seems consistent with your > calculations that showed the 40ns overhead with +/- 10ns precision. Crazy idea: Instead of trying to time precisely the amount of time we spend in each node, configure a very-high frequency timer interrupt (or background thread?) that does: SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I posted this on this mailing list before at Jane Street we have developed very fast code to get timing information based on TSC if available. It's all ocaml but well documented and mostly just calls to c functions so should be easy to port to C and we release it under a very liberal license so it should be no problem to take the ideas:
https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli
Hope this is useful.https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli
On Wed, May 14, 2014 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote:Crazy idea: Instead of trying to time precisely the amount of time we
> I always assumed the kernel used rdtsc to implement some of the high
> performance timers. It can save the current time in a mapped page when
> it schedules a process and then in the vdso syscall (ie in user-space)
> it can use rdtsc to calculate the offset needed to adjust that
> timestamp to the current time. This seems consistent with your
> calculations that showed the 40ns overhead with +/- 10ns precision.
spend in each node, configure a very-high frequency timer interrupt
(or background thread?) that does:
SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++;
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann <bgrundmann@janestreet.com> wrote:
BeneI posted this on this mailing list before at Jane Street we have developed very fast code to get timing information based on TSC if available. It's all ocaml but well documented and mostly just calls to c functions so should be easy to port to C and we release it under a very liberal license so it should be no problem to take the ideas:Hope this is useful.
https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli
Also I'm sorry for top posting. hackers is the only mailing list I'm on that requires this and some others require top posting so this runs counter my habits and I only realized after sending...
On Wed, May 14, 2014 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:On Tue, May 13, 2014 at 11:34 PM, Greg Stark <stark@mit.edu> wrote:Crazy idea: Instead of trying to time precisely the amount of time we
> I always assumed the kernel used rdtsc to implement some of the high
> performance timers. It can save the current time in a mapped page when
> it schedules a process and then in the vdso syscall (ie in user-space)
> it can use rdtsc to calculate the offset needed to adjust that
> timestamp to the current time. This seems consistent with your
> calculations that showed the 40ns overhead with +/- 10ns precision.
spend in each node, configure a very-high frequency timer interrupt
(or background thread?) that does:
SomeGlobalVariablePointingToTheCurrentNode->profiling_counter++;
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann <bgrundmann@janestreet.com> wrote: > I posted this on this mailing list before at Jane Street we have developed > very fast code to get timing information based on TSC if available. It's > all ocaml but well documented and mostly just calls to c functions so should > be easy to port to C and we release it under a very liberal license so it > should be no problem to take the ideas: What OS do you run it on though? How fast is your implementation compared to the kernel implementation of clock_gettime()? Are you sure your implementation is actually faster? And are you sure you're protected against clocks going backwards? I think you should put some i/o in the loop in the test and start several threads running it to make it more likely the thread is rescheduled to a different processor during the test. It suspect you'll find the rdtsc goes backwards sometimes or produces crazy results when switching processors. -- greg
On Thu, May 15, 2014 at 11:31 AM, Greg Stark <stark@mit.edu> wrote:
There are benchmarks in the link I posted (obtained by a micro benchmarking library we developed / use internally which takes great care to obtain reliable numbers) . We use posix threads extensively. We internally spend a lot of time setting up ntp and monitoring systems so that clock backwards never happens (so with other words I wouldn't be surprised if the library does NOT work correctly when it does -- our protection is outside). I do not believe we have seen the tdtsc going backwards on thread context switch you mention (and as said we use lots of threads). OS? Centos 6.5 primarily.
> I posted this on this mailing list before at Jane Street we have developedWhat OS do you run it on though? How fast is your implementation
> very fast code to get timing information based on TSC if available. It's
> all ocaml but well documented and mostly just calls to c functions so should
> be easy to port to C and we release it under a very liberal license so it
> should be no problem to take the ideas:
compared to the kernel implementation of clock_gettime()?
Are you sure your implementation is actually faster? And are you sure
you're protected against clocks going backwards? I think you should
put some i/o in the loop in the test and start several threads running
it to make it more likely the thread is rescheduled to a different
processor during the test. It suspect you'll find the rdtsc goes
backwards sometimes or produces crazy results when switching
processors.
There are benchmarks in the link I posted (obtained by a micro benchmarking library we developed / use internally which takes great care to obtain reliable numbers) . We use posix threads extensively. We internally spend a lot of time setting up ntp and monitoring systems so that clock backwards never happens (so with other words I wouldn't be surprised if the library does NOT work correctly when it does -- our protection is outside). I do not believe we have seen the tdtsc going backwards on thread context switch you mention (and as said we use lots of threads). OS? Centos 6.5 primarily.
--
greg
On 2014-05-15 12:04:25 +0100, Benedikt Grundmann wrote: > On Thu, May 15, 2014 at 11:31 AM, Greg Stark <stark@mit.edu> wrote: > > > On Thu, May 15, 2014 at 8:19 AM, Benedikt Grundmann > > <bgrundmann@janestreet.com> wrote: > > > I posted this on this mailing list before at Jane Street we have > > developed > > > very fast code to get timing information based on TSC if available. It's > > > all ocaml but well documented and mostly just calls to c functions so > > should > > > be easy to port to C and we release it under a very liberal license so it > > > should be no problem to take the ideas: > > > > What OS do you run it on though? How fast is your implementation > > compared to the kernel implementation of clock_gettime()? > > > > Are you sure your implementation is actually faster? And are you sure > > you're protected against clocks going backwards? I think you should > > put some i/o in the loop in the test and start several threads running > > it to make it more likely the thread is rescheduled to a different > > processor during the test. It suspect you'll find the rdtsc goes > > backwards sometimes or produces crazy results when switching > > processors. > > > > > There are benchmarks in the link I posted (obtained by a micro benchmarking > library we developed / use internally which takes great care to obtain > reliable numbers) . We use posix threads extensively. We internally spend > a lot of time setting up ntp and monitoring systems so that clock backwards > never happens (so with other words I wouldn't be surprised if the library > does NOT work correctly when it does -- our protection is outside). I do > not believe we have seen the tdtsc going backwards on thread context switch > you mention (and as said we use lots of threads). OS? Centos 6.5 > primarily. Did you test it on server with more one socket (i.e. not just multiple cores, but distinct cpu cases)? That's where you expect to see differences in TSC to have funny effects. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2014-05-13 18:58:11 -0400, Tom Lane wrote: > Anyway it looks like clock_gettime() might be worth using on Linux > just for the more precise output. It doesn't seem to exist on OS X > though, and I have no idea about elsewhere. Agreed that using clock_gettime() would be a good idea. I'd say we should have a wrapper around it that is able to provide nanosecond precision. If only gettimeofday() (and whatever windows is using) is available, we can dynamically fall back to that. > I'm curious if anybody has ideas about other things we might do for > portable high-precision timing. It's far from a solve-it-all, but can we perhaps try to coalesce repeated time measurements? We'll very frequently do a InstrStopNode(); /* minimal amount of work */ InstrStartNode(); which will measure the time twice. I think there's a fair number of scenarios where once would be enough. I'll freely admit that I haven't looked enough to determine how we could do that API wise. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote: > A recent question from Tim Kane prompted me to measure the overhead > costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things > are far worse than I thought. On my current server (by no means > lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run > at something like 110 nsec per row: I assume you ran pg_test_timing too: Testing timing overhead for 3 seconds.Per loop time including overhead: 41.70 nsecHistogram of timing durations:< usec % of total count 1 95.83035 68935459 2 4.16923 2999133 4 0.00037 268 8 0.00004 31 16 0.00000 1 32 0.00000 1 My overhead of 41.70 nsec matches yours. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:
> A recent question from Tim Kane prompted me to measure the overhead
> costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
> are far worse than I thought. On my current server (by no means
> lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
> at something like 110 nsec per row:
I assume you ran pg_test_timing too:
Testing timing overhead for 3 seconds.
Per loop time including overhead: 41.70 nsec
Histogram of timing durations:
< usec % of total count
1 95.83035 68935459
2 4.16923 2999133
4 0.00037 268
8 0.00004 31
16 0.00000 1
32 0.00000 1
My overhead of 41.70 nsec matches yours.
Did this idea die, or is it still worth considering?
Thom
Thom Brown <thom@linux.com> writes: > On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote: >> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote: >>> A recent question from Tim Kane prompted me to measure the overhead >>> costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things >>> are far worse than I thought. On my current server (by no means >>> lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run >>> at something like 110 nsec per row: > Did this idea die, or is it still worth considering? We still have a problem, for sure. I'm not sure that there was any consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME) if available would be a straightforward change that should ameliorate gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do anything to fix the excessive-overhead problem. The ideas about the latter were all over the map, and none of them looked easy. If you're feeling motivated to work on this area, feel free. regards, tom lane
On 6/8/16 9:56 AM, Tom Lane wrote: > Thom Brown <thom@linux.com> writes: >> On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote: >>> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote: >>>> A recent question from Tim Kane prompted me to measure the overhead >>>> costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things >>>> are far worse than I thought. On my current server (by no means >>>> lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run >>>> at something like 110 nsec per row: > >> Did this idea die, or is it still worth considering? > > We still have a problem, for sure. I'm not sure that there was any > consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME) > if available would be a straightforward change that should ameliorate > gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do > anything to fix the excessive-overhead problem. The ideas about the > latter were all over the map, and none of them looked easy. > > If you're feeling motivated to work on this area, feel free. Semi-related: someone (Robert I think) recently mentioned investigating "vectorized" executor nodes, where multiple tuples would be processed in one shot. If we had that presumably the explain penalty would be a moot point. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Semi-related: someone (Robert I think) recently mentioned investigating > "vectorized" executor nodes, where multiple tuples would be processed in one > shot. If we had that presumably the explain penalty would be a moot point. Yeah, both Andres and I are interested in that, and I think he's actively working on it. It would be quite neat if this had the effect of reducing EXPLAIN ANALYZE's overhead to something trivial. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-07 14:43:31 -0400, Robert Haas wrote: > On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > > Semi-related: someone (Robert I think) recently mentioned investigating > > "vectorized" executor nodes, where multiple tuples would be processed in one > > shot. If we had that presumably the explain penalty would be a moot point. > > Yeah, both Andres and I are interested in that, and I think he's > actively working on it. It would be quite neat if this had the effect > of reducing EXPLAIN ANALYZE's overhead to something trivial. I am, and it does reduce the overhead. Depends on the type of plan though. Index nestloops e.g. don't benefit on the inner side. Andres
On Thu, Jun 9, 2016 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thom Brown <thom@linux.com> writes: >> On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote: >>> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote: >>>> A recent question from Tim Kane prompted me to measure the overhead >>>> costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things >>>> are far worse than I thought. On my current server (by no means >>>> lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run >>>> at something like 110 nsec per row: > >> Did this idea die, or is it still worth considering? > > We still have a problem, for sure. I'm not sure that there was any > consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME) > if available would be a straightforward change that should ameliorate > gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do > anything to fix the excessive-overhead problem. The ideas about the > latter were all over the map, and none of them looked easy. > > If you're feeling motivated to work on this area, feel free. How about using both CLOCK_REALTIME and CLOCK_REALTIME_COARSE as the clock id's in clock_gettime wherever applicable. COARSE option is used wherever there is no timing calculation is required, because in my laptop, there is a significant performance difference is observed (like 8 times) compared to CLOCK_REALTIME. If it is fine, I will try to update the code and send a patch. Regards, Hari Babu Fujitsu Australia
On Thu, Aug 25, 2016 at 3:12 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Thu, Jun 9, 2016 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thom@linux.com> writes:
>> On 15 May 2014 at 19:56, Bruce Momjian <bruce@momjian.us> wrote:
>>> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:
>>>> A recent question from Tim Kane prompted me to measure the overhead
>>>> costs of EXPLAIN ANALYZE, which I'd not checked in awhile. Things
>>>> are far worse than I thought. On my current server (by no means
>>>> lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
>>>> at something like 110 nsec per row:
>
>> Did this idea die, or is it still worth considering?
>
> We still have a problem, for sure. I'm not sure that there was any
> consensus on what to do about it. Using clock_gettime(CLOCK_REALTIME)
> if available would be a straightforward change that should ameliorate
> gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
> anything to fix the excessive-overhead problem. The ideas about the
> latter were all over the map, and none of them looked easy.
>
> If you're feeling motivated to work on this area, feel free.
How about using both CLOCK_REALTIME and CLOCK_REALTIME_COARSE
as the clock id's in clock_gettime wherever applicable. COARSE option is used
wherever there is no timing calculation is required, because in my laptop, there
is a significant performance difference is observed (like 8 times) compared to
CLOCK_REALTIME.
If it is fine, I will try to update the code and send a patch.
timeofday(user callable) and GetCurrentTimestamp functions.
Didn't add any configure checks in case if the clock_gettime function is not available,
Didn't add any configure checks in case if the clock_gettime function is not available,
the fallback logic to gettimeofday function call.
Any comments in proceeding further?
Hari Babu
Fujitsu Australia
Attachment
Haribabu Kommi <kommi.haribabu@gmail.com> writes: > Attached a patch that replaces most of the getimeofday function calls, > except timeofday(user callable) and GetCurrentTimestamp functions. > Didn't add any configure checks in case if the clock_gettime function is > not available, the fallback logic to gettimeofday function call. Well, of course, configure is the hard part. I got interested in this area again pursuant to a question from Joel Jacobson, and looked around to see if things had changed any since 2014. One pleasant surprise is that Apple got around to implementing clock_gettime() as of the current macOS release (10.12 "Sierra"). That means that pretty much all interesting platforms now have clock_gettime(), which removes one possible objection to starting to use it. However, it seems like there is not a lot of commonality to the best "clockid" to use. In theory, according to the POSIX spec, CLOCK_MONOTONIC would be what we want to use for time interval measurement (EXPLAIN ANALYZE), since that would be impervious to possible clock setting changes. But some implementations might only optimize the more common CLOCK_REALTIME, and I found that there are a lot of platform-specific clock IDs that we might want to consider. On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(), clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME) all take about 40ns. Of course gettimeofday() only has 1us resolution, but the other two have perhaps 10ns resolution (I get no duplicate readings in a tight loop). Other documented clockids include CLOCK_REALTIME_COARSE: about 10ns to read, but only 1ms resolution CLOCK_MONOTONIC_COARSE: about 12ns to read, but only 1ms resolution CLOCK_MONOTONIC_RAW: full resolution butvery slow, ~145ns to read So CLOCK_MONOTONIC seems to be the thing to use here. It won't buy us anything speed-wise but the extra resolution will be nice. However, we need to do more research to see if this holds true on other popular distros. On macOS (10.12.2, 2.7GHz x86_64), clock_gettime(CLOCK_REALTIME) is actually a shade faster than gettimeofday: 40ns versus 46ns. But it's only giving 1us resolution, no better than gettimeofday. CLOCK_MONOTONIC is also 1us and it takes 75ns to read. But there's a CLOCK_MONOTONIC_RAW that takes 44ns to read and seems to offer full precision -- no duplicate readings in a tight loop. There's also CLOCK_MONOTONIC_RAW_APPROX which can be read in 23ns but the resolution is only around half an ms. I also tried FreeBSD 11.0 on another Mac (2.3GHz x86_64), and found that gettimeofday as well as basically all their clock_gettime variants run in 27 to 28 ns; and clock_gettime reliably delivers full precision, except for CLOCK_SECOND which is intentionally truncated to 1s precision. So there would be no need to work with anything but CLOCK_MONOTONIC here. However, it seems that these impressive results date back only to June 2012, cf https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8 and at least as of that commit, only x86 and x86_64 had the fast clock_gettime code. Older FreeBSD, or FreeBSD on another architecture, is likely to be a lot worse. But I lack an installation to try. I also tried OpenBSD 6.0 on that same Mac, and got pretty horrid results: gettimeofday, CLOCK_REALTIME, and CLOCK_MONOTONIC all take about 613ns to read. Ouch. And so does time(NULL); double ouch. Evidently there's no optimization on this platform and what we're seeing is the minimum cost for a kernel call. Still, we do get better precision from clock_gettime than gettimeofday, so we might as well switch. So it seems like the configure support we'd need is to detect whether clock_gettime is available (note on Linux there's also a library requirement, -lrt), and we would also need a way to provide a platform-specific choice of clockid; we at least need enough smarts to use CLOCK_MONOTONIC_RAW on macOS. regards, tom lane
On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
So it seems like the configure support we'd need is to detect
whether clock_gettime is available (note on Linux there's also
a library requirement, -lrt), and we would also need a way to
provide a platform-specific choice of clockid; we at least need
enough smarts to use CLOCK_MONOTONIC_RAW on macOS.
This seems like something that really should be checked at runtime. It's very specific to the specific kernel you're running on, not the build environment, and it can hopefully be measured in only a second or even a fraction of a second. The only Pebblebrook would be if other things running on the system made the test results unpredictable so that you had a small chance of getting a very suboptimal choice and we ruling the dice each time you restarted...
On 2016-12-27 01:35:05 +0000, Greg Stark wrote: > On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > > > So it seems like the configure support we'd need is to detect > whether clock_gettime is available (note on Linux there's also > a library requirement, -lrt), and we would also need a way to > provide a platform-specific choice of clockid; we at least need > enough smarts to use CLOCK_MONOTONIC_RAW on macOS. > > This seems like something that really should be checked at runtime. It's > very specific to the specific kernel you're running on, not the build > environment, and it can hopefully be measured in only a second or even a > fraction of a second. The only Pebblebrook would be if other things running > on the system made the test results unpredictable so that you had a small > chance of getting a very suboptimal choice and we ruling the dice each time > you restarted... I'm pretty strongly against doing performance measurements at startup. Both the delay and the potential for differing test results seem like pretty bad consequences. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2016-12-27 01:35:05 +0000, Greg Stark wrote: >> On Dec 26, 2016 10:35 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >>> So it seems like the configure support we'd need is to detect >>> whether clock_gettime is available (note on Linux there's also >>> a library requirement, -lrt), and we would also need a way to >>> provide a platform-specific choice of clockid; we at least need >>> enough smarts to use CLOCK_MONOTONIC_RAW on macOS. >> This seems like something that really should be checked at runtime. > I'm pretty strongly against doing performance measurements at > startup. Both the delay and the potential for differing test results > seem like pretty bad consequences. Yeah, that doesn't sound great to me either. And I don't entirely see the point, at least not with what we know now. I am a bit concerned that we'll find out there are popular platforms where clock_gettime compiles but fails with ENOSYS, or some similarly unhelpful behavior. But we won't find that out if we don't try. regards, tom lane
* Tom Lane: > On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(), > clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME) > all take about 40ns. Of course gettimeofday() only has 1us resolution, > but the other two have perhaps 10ns resolution (I get no duplicate > readings in a tight loop). Other documented clockids include > CLOCK_REALTIME_COARSE: about 10ns to read, but only 1ms resolution > CLOCK_MONOTONIC_COARSE: about 12ns to read, but only 1ms resolution > CLOCK_MONOTONIC_RAW: full resolution but very slow, ~145ns to read > So CLOCK_MONOTONIC seems to be the thing to use here. It won't buy > us anything speed-wise but the extra resolution will be nice. > However, we need to do more research to see if this holds true on > other popular distros. Isn't this very specific to kernel and glibc versions, depending on things like CONFIG_HZ settings and what level of vDSO support has been backported?
Florian Weimer <fw@deneb.enyo.de> writes: > * Tom Lane: >> On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(), >> clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME) >> all take about 40ns. Of course gettimeofday() only has 1us resolution, >> but the other two have perhaps 10ns resolution (I get no duplicate >> readings in a tight loop). > Isn't this very specific to kernel and glibc versions, depending on > things like CONFIG_HZ settings and what level of vDSO support has been > backported? No doubt, but I have yet to find a platform where clock_gettime() exists but performs worse than gettimeofday(). Do you know of one? regards, tom lane
On Tue, Dec 27, 2016 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I also tried FreeBSD 11.0 on another Mac (2.3GHz x86_64), > and found that gettimeofday as well as basically all their > clock_gettime variants run in 27 to 28 ns; and clock_gettime > reliably delivers full precision, except for CLOCK_SECOND which > is intentionally truncated to 1s precision. So there would be > no need to work with anything but CLOCK_MONOTONIC here. > > However, it seems that these impressive results date back only to > June 2012, cf > https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8 > and at least as of that commit, only x86 and x86_64 had the fast > clock_gettime code. Older FreeBSD, or FreeBSD on another architecture, > is likely to be a lot worse. But I lack an installation to try. That commit is in every 'production' and 'legacy' release of FreeBSD[1], meaning as far back as 9.3 (expected to be EoL in the next few days), because it landed in 9.2 (EoL). ARM support landed in FreeBSD 11.0[2]. That leaves the following architectures without fast-path time functions: macaque:freebsd munro$ git grep 'trivial-vdso_tc.c' lib/libc/mips/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c lib/libc/powerpc/Makefile.inc:SRCS+= trivial-vdso_tc.c lib/libc/powerpc64/Makefile.inc:SRCS+= trivial-vdso_tc.c lib/libc/powerpcspe/Makefile.inc:SRCS+= trivial-vdso_tc.c lib/libc/riscv/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c lib/libc/sparc64/Makefile.inc:SRCS+= trivial-vdso_tc.c [1] https://www.freebsd.org/releases/ [2] https://github.com/freebsd/freebsd/commit/80e8626b434515d16b3576174438526755336810 -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Tue, Dec 27, 2016 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, it seems that these impressive results date back only to >> June 2012, cf >> https://github.com/freebsd/freebsd/commit/13a9f42818f6b89a72b3e40923be809b490400d8 >> and at least as of that commit, only x86 and x86_64 had the fast >> clock_gettime code. Older FreeBSD, or FreeBSD on another architecture, >> is likely to be a lot worse. But I lack an installation to try. > That commit is in every 'production' and 'legacy' release of > FreeBSD[1], meaning as far back as 9.3 (expected to be EoL in the next > few days), because it landed in 9.2 (EoL). I'm unclear on whether there's any significant number of people running out-of-support *BSD releases. If it's not something we have to worry about, fine. > That leaves the following architectures without > fast-path time functions: > macaque:freebsd munro$ git grep 'trivial-vdso_tc.c' > lib/libc/mips/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c > lib/libc/powerpc/Makefile.inc:SRCS+= trivial-vdso_tc.c > lib/libc/powerpc64/Makefile.inc:SRCS+= trivial-vdso_tc.c > lib/libc/powerpcspe/Makefile.inc:SRCS+= trivial-vdso_tc.c > lib/libc/riscv/sys/Makefile.inc:SRCS+= trivial-vdso_tc.c > lib/libc/sparc64/Makefile.inc:SRCS+= trivial-vdso_tc.c Yeah, I just finished getting results from FreeBSD 10.3 on PPC (1.33GHz G4 laptop): gettimeofday takes about 1180 ns and clock_gettime about 1200 ns. That difference seems to be repeatable, but since it's only 2% I'm not too fussed about it. Interestingly, it's very easy to tell that it is entering the kernel, because time(1) shows a significant fraction of system time: $ time ./testclock 0 bogus readings 100000000 distinct readings 117.96 real 26.80 user 90.31 sys The same test on platforms with vDSO support shows zero system time. regards, tom lane
Haribabu Kommi <kommi.haribabu@gmail.com> writes: > Attached a patch that replaces most of the getimeofday function calls, > except timeofday(user callable) and GetCurrentTimestamp functions. I looked at this for awhile and could not convince myself that it's a good idea. Trying to do s/gettimeofday/clock_gettime/g is not going to do much for us except create portability headaches. According to my tests, clock_gettime is not noticeably faster than gettimeofday on any platform, except that if you use nonstandard clockids like CLOCK_REALTIME_COARSE then on *some* platforms it's a little bit quicker, at the cost of being a great deal less precise. But we'd have to research the existence and effects of nonstandard clockids on every platform. So AFAICS the only clear advantage to switching is the extra precision available from clock_gettime. But ... most of the places you've touched in this patch have neither any need for sub-microsecond precision nor any great need to worry about shaving a few ns off the time taken by the call. As far as I can find, the only place where it's actually worth our trouble to deal with it is instr_time.h (ie, EXPLAIN ANALYZE and a few other uses). So I think we should do something more like the attached. One issue I did not resolve in this WIP patch is what to do with this gem of abstraction violation in pgbench: /* no, print raw transactions */ #ifndef WIN32 /* This is more than we really ought to know about instr_time */ if (skipped) fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld", st->id, st->cnt, st->use_file, (long) now->tv_sec, (long) now->tv_usec); else fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld", st->id, st->cnt, latency, st->use_file, (long) now->tv_sec, (long) now->tv_usec); #else /* On Windows, instr_time doesn't provide a timestamp anyway */ if (skipped) fprintf(logfile, "%d " INT64_FORMAT " skipped %d 0 0", st->id, st->cnt, st->use_file); else fprintf(logfile, "%d " INT64_FORMAT " %.0f %d 0 0", st->id, st->cnt, latency, st->use_file); #endif We could either rip out the non-Windows code path entirely, or do something about providing an honest elapsed-time measurement, perhaps by doing INSTR_TIME_GET_DOUBLE() on the diff from run start to "now". Given that we're calling fprintf, I doubt that the extra arithmetic needed for that is a big problem. regards, tom lane diff --git a/configure b/configure index 0f143a0..e5dd6fb 100755 *** a/configure --- b/configure *************** if test "$ac_res" != no; then : *** 9055,9060 **** --- 9055,9116 ---- fi + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing clock_gettime" >&5 + $as_echo_n "checking for library containing clock_gettime... " >&6; } + if ${ac_cv_search_clock_gettime+:} false; then : + $as_echo_n "(cached) " >&6 + else + ac_func_search_save_LIBS=$LIBS + cat confdefs.h - <<_ACEOF >conftest.$ac_ext + /* end confdefs.h. */ + + /* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ + #ifdef __cplusplus + extern "C" + #endif + char clock_gettime (); + int + main () + { + return clock_gettime (); + ; + return 0; + } + _ACEOF + for ac_lib in '' rt posix4; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_c_try_link "$LINENO"; then : + ac_cv_search_clock_gettime=$ac_res + fi + rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_clock_gettime+:} false; then : + break + fi + done + if ${ac_cv_search_clock_gettime+:} false; then : + + else + ac_cv_search_clock_gettime=no + fi + rm conftest.$ac_ext + LIBS=$ac_func_search_save_LIBS + fi + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_clock_gettime" >&5 + $as_echo "$ac_cv_search_clock_gettime" >&6; } + ac_res=$ac_cv_search_clock_gettime + if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + + fi + # Solaris: { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5 $as_echo_n "checking for library containing fdatasync... " >&6; } *************** fi *** 12520,12526 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 12576,12582 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index b9831bc..77aa3b4 100644 *** a/configure.in --- b/configure.in *************** AC_SEARCH_LIBS(getopt_long, [getopt gnug *** 1016,1021 **** --- 1016,1022 ---- AC_SEARCH_LIBS(crypt, crypt) AC_SEARCH_LIBS(shm_open, rt) AC_SEARCH_LIBS(shm_unlink, rt) + AC_SEARCH_LIBS(clock_gettime, [rt posix4]) # Solaris: AC_SEARCH_LIBS(fdatasync, [rt posix4]) # Required for thread_test.c on Solaris *************** PGAC_FUNC_WCSTOMBS_L *** 1415,1421 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1416,1422 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 42a3fc8..b9dfdd4 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *************** *** 105,110 **** --- 105,113 ---- /* Define to 1 if you have the `class' function. */ #undef HAVE_CLASS + /* Define to 1 if you have the `clock_gettime' function. */ + #undef HAVE_CLOCK_GETTIME + /* Define to 1 if you have the <crtdefs.h> header file. */ #undef HAVE_CRTDEFS_H diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index ceb8b79..199668c 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *************** *** 75,80 **** --- 75,83 ---- /* Define to 1 if you have the `class' function. */ /* #undef HAVE_CLASS */ + /* Define to 1 if you have the `clock_gettime' function. */ + /* #undef HAVE_CLOCK_GETTIME */ + /* Define to 1 if you have the `crypt' function. */ /* #undef HAVE_CRYPT */ diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 16caf6e..b10d6f0 100644 *** a/src/include/portability/instr_time.h --- b/src/include/portability/instr_time.h *************** *** 4,13 **** * portable high-precision interval timing * * This file provides an abstraction layer to hide portability issues in ! * interval timing. On Unix we use gettimeofday(), but on Windows that ! * gives a low-precision result so we must use QueryPerformanceCounter() ! * instead. These macros also give some breathing room to use other ! * high-precision-timing APIs on yet other platforms. * * The basic data type is instr_time, which all callers should treat as an * opaque typedef. instr_time can store either an absolute time (of --- 4,13 ---- * portable high-precision interval timing * * This file provides an abstraction layer to hide portability issues in ! * interval timing. On Unix we use clock_gettime() if available, else ! * gettimeofday(). On Windows, gettimeofday() gives a low-precision result ! * so we must use QueryPerformanceCounter() instead. These macros also give ! * some breathing room to use other high-precision-timing APIs. * * The basic data type is instr_time, which all callers should treat as an * opaque typedef. instr_time can store either an absolute time (of *************** *** 54,59 **** --- 54,147 ---- #ifndef WIN32 + #ifdef HAVE_CLOCK_GETTIME + + /* Use clock_gettime() */ + + #include <time.h> + + /* + * The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC, + * since that will give reliable interval timing even in the face of changes + * to the system clock. However, POSIX doesn't require implementations to + * provide anything except CLOCK_REALTIME, so fall back to that if we don't + * find CLOCK_MONOTONIC. + * + * Also, some implementations have nonstandard clockids with better properties + * than CLOCK_MONOTONIC. In particular, as of macOS 10.12, Apple provides + * CLOCK_MONOTONIC_RAW which is both faster to read and higher resolution than + * their version of CLOCK_MONOTONIC. + */ + #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW) + #define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW + #elif defined(CLOCK_MONOTONIC) + #define PG_INSTR_CLOCK CLOCK_MONOTONIC + #else + #define PG_INSTR_CLOCK CLOCK_REALTIME + #endif + + typedef struct timespec instr_time; + + #define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) + + #define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) + + #define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) + + #define INSTR_TIME_ADD(x,y) \ + do { \ + (x).tv_sec += (y).tv_sec; \ + (x).tv_nsec += (y).tv_nsec; \ + /* Normalize */ \ + while ((x).tv_nsec >= 1000000000) \ + { \ + (x).tv_nsec -= 1000000000; \ + (x).tv_sec++; \ + } \ + } while (0) + + #define INSTR_TIME_SUBTRACT(x,y) \ + do { \ + (x).tv_sec -= (y).tv_sec; \ + (x).tv_nsec -= (y).tv_nsec; \ + /* Normalize */ \ + while ((x).tv_nsec < 0) \ + { \ + (x).tv_nsec += 1000000000; \ + (x).tv_sec--; \ + } \ + } while (0) + + #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ + do { \ + (x).tv_sec += (y).tv_sec - (z).tv_sec; \ + (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ + /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ + while ((x).tv_nsec < 0) \ + { \ + (x).tv_nsec += 1000000000; \ + (x).tv_sec--; \ + } \ + while ((x).tv_nsec >= 1000000000) \ + { \ + (x).tv_nsec -= 1000000000; \ + (x).tv_sec++; \ + } \ + } while (0) + + #define INSTR_TIME_GET_DOUBLE(t) \ + (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0) + + #define INSTR_TIME_GET_MILLISEC(t) \ + (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0) + + #define INSTR_TIME_GET_MICROSEC(t) \ + (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000)) + + #else /* !HAVE_CLOCK_GETTIME */ + + /* Use gettimeofday() */ + #include <sys/time.h> typedef struct timeval instr_time; *************** typedef struct timeval instr_time; *** 113,120 **** --- 201,213 ---- #define INSTR_TIME_GET_MICROSEC(t) \ (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec) + + #endif /* HAVE_CLOCK_GETTIME */ + #else /* WIN32 */ + /* Use QueryPerformanceCounter() */ + typedef LARGE_INTEGER instr_time; #define INSTR_TIME_IS_ZERO(t) ((t).QuadPart == 0) *************** GetTimerFrequency(void) *** 149,154 **** --- 242,248 ---- QueryPerformanceFrequency(&f); return (double) f.QuadPart; } + #endif /* WIN32 */ #endif /* INSTR_TIME_H */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 30, 2016 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Haribabu Kommi <kommi.haribabu@gmail.com> writes:
> Attached a patch that replaces most of the getimeofday function calls,
> except timeofday(user callable) and GetCurrentTimestamp functions.
I looked at this for awhile and could not convince myself that it's
a good idea. Trying to do s/gettimeofday/clock_gettime/g is not going
to do much for us except create portability headaches. According
to my tests, clock_gettime is not noticeably faster than gettimeofday
on any platform, except that if you use nonstandard clockids like
CLOCK_REALTIME_COARSE then on *some* platforms it's a little bit quicker,
at the cost of being a great deal less precise. But we'd have to research
the existence and effects of nonstandard clockids on every platform.
So AFAICS the only clear advantage to switching is the extra precision
available from clock_gettime.
But ... most of the places you've touched in this patch have neither any
need for sub-microsecond precision nor any great need to worry about
shaving a few ns off the time taken by the call. As far as I can find,
the only place where it's actually worth our trouble to deal with it is
instr_time.h (ie, EXPLAIN ANALYZE and a few other uses).
So I think we should do something more like the attached.
Thanks for your valuable input.
As the getimeofday() function is obsolete and any further enhancements
may happen to clock_gettime() function only, because of this reason, I changed
it many places.
Yes, I agree that until unless the clock_gettime() function that performs faster
in all platforms compared to gettimeofday(), we can retain the getimeofday()
function.
Regards,
Hari Babu
Fujitsu Australia
* Tom Lane: > Florian Weimer <fw@deneb.enyo.de> writes: >> * Tom Lane: >>> On Linux (RHEL6, 2.4GHz x86_64), I find that gettimeofday(), >>> clock_gettime(CLOCK_MONOTONIC), and clock_gettime(CLOCK_REALTIME) >>> all take about 40ns. Of course gettimeofday() only has 1us resolution, >>> but the other two have perhaps 10ns resolution (I get no duplicate >>> readings in a tight loop). > >> Isn't this very specific to kernel and glibc versions, depending on >> things like CONFIG_HZ settings and what level of vDSO support has been >> backported? > > No doubt, but I have yet to find a platform where clock_gettime() exists > but performs worse than gettimeofday(). Do you know of one? ppc64le with all the vDSO fixes for clock_gettime? glibc has some test cases which fail because clock_gettime gives inconsistent results. This has been fixed in current kernels, but I don't know if everyone uses them.
Florian Weimer <fw@deneb.enyo.de> writes: > glibc has some test cases which fail because clock_gettime gives > inconsistent results. This has been fixed in current kernels, but I > don't know if everyone uses them. hmm, for which clock IDs? regards, tom lane