Thread: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Andres Freund

Date:

12 June 2020, 23:28:10

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN
       │
 

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1)
       │
 
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000
loops=1)│
 
│ Planning Time: 0.028 ms
       │
 
│ Execution Time: 2610.256 ms
       │
 

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.


There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not.  The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.


The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed.  If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Greetings,

Andres Freund

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Pavel Stehule

Date:

13 June 2020, 03:53:01

so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres@anarazel.de> napsal:

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.

Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.

Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
- 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined)
- agg_retrieve_direct (inlined)
- 79.27% fetch_input_tuple
- ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
+ 25.22% SeqNext
- 21.74% InstrStopNode
+ 17.80% __GI___clock_gettime (inlined)
- 21.44% InstrStartNode
+ 19.23% __GI___clock_gettime (inlined)
+ 4.06% ExecScan
+ 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp

The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.

In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms │
│ Execution Time: 2610.256 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...

In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).

I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.

There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Pavel

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Lukas Fittl

Date:

01 July 2022, 08:23:01

On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Date:

06 September 2022, 06:32:18

On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
>...
> > Known WIP problems with this patch version:
> >
> > * There appears to be a timing discrepancy I haven't yet worked out, where
> > the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> > reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> > higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > server
> > side, only when rdtsc measurement is used -- its likely there is a problem
> > somewhere with how we perform the cycles to time conversion
>
> Could you explain a bit more what you're seeing? I just tested your patches
> and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek

The patch requires a rebase; please rebase the patch with the latest code.

Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm

Ibrar Ahmed

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Michael Paquier

Date:

12 October 2022, 08:33:42

On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:
> Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
> Hunk #6 FAILED at 170.
> Hunk #7 succeeded at 165 (offset -69 lines).
> 2 out of 7 hunks FAILED -- saving rejects to file
> src/include/portability/instr_time.h.rej
> patching file src/tools/msvc/Mkvcbuild.pm

No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael

Attachment

signature.asc

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

19 November 2022, 20:05:28

I think it would be great to get this patch committed. Beyond the 
reasons already mentioned, the significant overhead also tends to skew 
the reported runtimes in ways that makes it difficult to compare them. 
For example, if two nodes are executed equally often but one needs twice 
the time to process the rows: in such a case EXPLAIN ANALYZE should 
report timings that are 2x apart. However, currently, the high overhead 
of clock_gettime() tends to skew the relative runtimes.

On 10/12/22 10:33, Michael Paquier wrote:
> No rebased version has been sent since this update, so this patch has
> been marked as RwF.

I've rebased the patch set on latest master and fixed a few compiler 
warnings. Beyond that some findings and thoughts:

You're only using RDTSC if the clock source is 'tsc'. Great idea to not 
bother caring about a lot of hairy TSC details. Looking at the kernel 
code this seems to imply that the TSC is frequency invariant. I don't 
think though that this implies that Linux is not running under a 
hypervisor; which is good because I assume PostgreSQL is used a lot in 
VMs. However, when running under a hypervisor (at least with VMWare) 
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated 
success but the returned values were garbage. Instead of using leaf 
0x16, we should then use the hypervisor interface to obtain the TSC 
frequency. Checking if a hypervisor is active can be done via:

bool IsHypervisorActive()
{
     uint32 cpuinfo[4] = {0};
     int res = __get_cpuid(0x1, &cpuinfo[0], &cpuinfo[1], &cpuinfo[2], 
&cpuinfo[3]);
     return res > 0 && (cpuinfo[2] & (1 << 30));
}

Obtaining the TSC frequency via the hypervisor interface can be done 
with the following code. See https://lwn.net/Articles/301888/ for more 
details.

// Under hypervisors (tested with VMWare) leaf 0x16 is not available, 
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to 
obtain TSC frequency.
uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x40000001, &cpuinfo[0], 
&cpuinfo[1], &cpuinfo[2], &cpuinfo[3]) > 0)
     cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);

Given that we anyways switch between RDTSC and clock_gettime() with a 
global variable, what about exposing the clock source as GUC? That way 
the user can switch back to a working clock source in case we miss a 
detail around activating or reading the TSC.

I'm happy to update the patches accordingly.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

19 November 2022, 20:06:23

I missed attaching the patches.

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

02 January 2023, 13:28:20

Hi,

I re-based again on master and applied the following changes:

I removed the fallback for obtaining the TSC frequency from /proc/cpu as 
suggested by Andres. Worst-case we fall back to clock_gettime().

I added code to obtain the TSC frequency via CPUID when under a 
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), 
because __get_cpuid() returns an error if the leaf is > 0x80000000 
(probably the implementation pre-dates the hypervisor timing leafs). 
Unfortunately, while testing my implementation under VMWare, I found 
that RDTSC runs awfully slow there (like 30x slower). [1] indicates that 
we cannot generally rely on RDTSC being actually fast on VMs. However, 
the same applies to clock_gettime(). It runs as slow as RDTSC on my 
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not 
entirely sure if there aren't cases where e.g. clock_gettime() is 
actually faster than RDTSC and it would be advantageous to use 
clock_gettime(). We could add a GUC so that the user can decide which 
clock source to use. Any thoughts?

I also somewhat improved the accuracy of the cycles to milli- and 
microseconds conversion functions by having two more multipliers with 
higher precision. For microseconds we could also keep the computation 
integer-only. I'm wondering what to best do for seconds and 
milliseconds. I'm currently leaning towards just keeping it as is, 
because the durations measured and converted are usually long enough 
that precision shouldn't be a problem.

In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I 
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I 
initialized a few variables of type instr_time which otherwise resulted 
in warnings due to use of potentially uninitialized variables.

I also couldn't reproduce the reported timing discrepancy. For me the 
runtime reported by \timing is just slightly higher than the time 
reported by EXPLAIN ANALYZE, which is expected.

Beyond that:

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so 
that it's consistent with the _MILLISEC() and _MICROSEC() variants?

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants 
return double. This seems error prone. What about renaming the function 
or also have the function return a double and cast where necessary at 
the call site?

If no one objects I would also re-register this patch in the commit fest.

[1] 
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf 
(page 11 "Virtual TSC")

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Lukas Fittl

Date:

02 January 2023, 19:50:04

Hi David,

Thanks for continuing to work on this patch, and my apologies for silence on the patch.

Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities.

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?

Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values.

If no one objects I would also re-register this patch in the commit fest.

+1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch.

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead.

[0]: https://www.postgresql.org/message-id/CAP53PkxXMk0j-%2B0%3DYwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g%40mail.gmail.com

Thanks,

Lukas

Lukas Fittl

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Maciek Sakrejda

Date:

02 January 2023, 20:44:42

On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
> On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> >...
> > > Known WIP problems with this patch version:
> > >
> > > * There appears to be a timing discrepancy I haven't yet worked out, where
> > >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> > >   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> > >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > > server
> > >   side, only when rdtsc measurement is used -- its likely there is a problem
> > >   somewhere with how we perform the cycles to time conversion
> >
> > Could you explain a bit more what you're seeing? I just tested your patches
> > and didn't see that here.
>
> I did not see this either, but I did see that the execution time
> reported by \timing is (for this test case) consistently 0.5-1ms
> *lower* than the Execution Time reported by EXPLAIN. I did not see
> that on master. Is that expected?

For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

03 January 2023, 08:38:20

Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:
> Thanks for continuing to work on this patch, and my apologies for 
> silence on the patch.

It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as 
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().

I rebased again on master because of [1]. Patches attached.

>
> Its been hard to make time, and especially so because I typically 
> develop on an ARM-based macOS system where I can't test this directly 
> - hence my tests with virtualized EC2 instances, where I ran into the 
> timing oddities.
That's good and bad. Bad to do the development and good to test the 
implementation on more virtualized setups; given that I also encountered 
"interesting" behavior on VMWare (see my previous mails).
>
> On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
>
>     The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
>     variants
>     return double. This seems error prone. What about renaming the
>     function
>     or also have the function return a double and cast where necessary at
>     the call site?
>
>
> Minor note, but in my understanding using a uint64 (where we can) is 
> faster for any simple arithmetic we do with the values.

That's true. So the argument could be that for seconds and milliseconds 
we want the extra precision while microseconds are precise enough. 
Still, we could also make the seconds and milliseconds conversion code 
integer only and e.g. return two integers with the value before and 
after the comma. FWICS, the functions are nowhere used in performance 
critical code, so it doesn't really make a difference performance-wise.

>
> +1, and feel free to carry this patch forward - I'll try to make an 
> effort to review my earlier testing issues again, as well as your 
> later improvements to the patch.
Moved to the current commit fest. Will you become reviewer?
>
> Also, FYI, I just posted an alternate idea for speeding up EXPLAIN 
> ANALYZE with timing over in [0], using a sampling-based approach to 
> reduce the timing overhead.

Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1] 
https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

vignesh C

Date:

04 January 2023, 10:15:05

On Tue, 3 Jan 2023 at 14:08, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi Lukas,
>
> On 1/2/23 20:50, Lukas Fittl wrote:
> > Thanks for continuing to work on this patch, and my apologies for
> > silence on the patch.
>
> It would be great if you could review it.
> Please also share your thoughts around exposing the used clock source as
> GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
>
> I rebased again on master because of [1]. Patches attached.
>
> >
> > Its been hard to make time, and especially so because I typically
> > develop on an ARM-based macOS system where I can't test this directly
> > - hence my tests with virtualized EC2 instances, where I ran into the
> > timing oddities.
> That's good and bad. Bad to do the development and good to test the
> implementation on more virtualized setups; given that I also encountered
> "interesting" behavior on VMWare (see my previous mails).
> >
> > On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
> >
> >     The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
> >     variants
> >     return double. This seems error prone. What about renaming the
> >     function
> >     or also have the function return a double and cast where necessary at
> >     the call site?
> >
> >
> > Minor note, but in my understanding using a uint64 (where we can) is
> > faster for any simple arithmetic we do with the values.
>
> That's true. So the argument could be that for seconds and milliseconds
> we want the extra precision while microseconds are precise enough.
> Still, we could also make the seconds and milliseconds conversion code
> integer only and e.g. return two integers with the value before and
> after the comma. FWICS, the functions are nowhere used in performance
> critical code, so it doesn't really make a difference performance-wise.
>
> >
> > +1, and feel free to carry this patch forward - I'll try to make an
> > effort to review my earlier testing issues again, as well as your
> > later improvements to the patch.
> Moved to the current commit fest. Will you become reviewer?
> >
> > Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
> > ANALYZE with timing over in [0], using a sampling-based approach to
> > reduce the timing overhead.
>
> Interesting idea. I'll reply with some thoughts on the corresponding thread.
>
> [1]
> https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

04 January 2023, 12:02:05

Hi,

> CFBot shows some compilation errors as in [1], please post an updated
> version for the same:
> 09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
> relocation against `cycles_to_sec' in read-only section `.text'
> [09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
> function `pg_clock_gettime_ref_cycles':
> [09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
> undefined reference to `use_rdtsc'
> [09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
> function `test_timing':
> [09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
> undefined reference to `pg_clock_gettime_initialize_rdtsc'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
> undefined reference to `cycles_to_sec'
> [09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
> [09:08:12.525] collect2: error: ld returned 1 exit status
>
> [1] - https://cirrus-ci.com/task/5375312565895168
>
> Regards,
> Vignesh

I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

-- 
David Geier
(ServiceNow)

po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal:

Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.

Pavel

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Robert Haas

Date:

17 January 2023, 13:46:12

On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

I read through 0001 and it seems basically fine to me. Comments:

1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.

3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

Andres Freund

Date:

17 January 2023, 16:47:58

Hi,

On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
> On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> > Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> > BufferUsage).

Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:

0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
      warn

      Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
      just allow to assign 0.

0002) Convert instr_time to uint64

      This is the cleaned up version of the prior patch. The main change is
      that it deduplicated a lot of the code between the architectures.

0003) Add INSTR_TIME_SET_SECOND()

      This is used in 0004. Just allows setting an instr_time to a time in
      seconds, allowing for a cheaper loop exit condition in 0004.

0004) report nanoseconds in pg_test_timing

I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?

> I read through 0001 and it seems basically fine to me. Comments:
>
> 1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

Fixed.

> 2. I'm not entirely sure that the new .?S_PER_.?S macros are
> worthwhile but maybe they are, and in any case I don't care very much.

There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.

> 3. I've always found 'struct timespec' to be pretty annoying
> notationally, so I like the fact that this patch would reduce use of
> it.

Same.

Greetings,

Andres Freund

Hi,

On 2023-01-17 10:50:53 -0800, Andres Freund wrote:
> On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> > >       warn
> > >       Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> > >       just allow to assign 0.
> >
> > I think it's probably wise to keep the macro.  If we ever rethink this
> > again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> > even if it would work with just compare-to-zero.
>
> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
> could give us the best of both worlds?

I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.

Comments?

> > I'm almost tempted to suggest you define instr_time as a struct with a
> > uint64 field, just to help keep us honest about that.
>
> I can see that making sense. Unless somebody pipes up with opposition to that
> plan soon, I'll see how it goes.

Done in the attached. I think it looks good. Actually found a type confusion
buglet in 0004, so the type safety benefit is noticable.

It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004,
but that seems fine.

Besides cosmetic stuff I also added back the cast to double in window's
INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it.

We should make this faster by pre-computing
  (double) NS_PER_S / GetTimerFrequency()
once, as that'd avoid doing the the slow division on every conversion. But
that's an old issue and thus better tackled separately.

Greetings,

Andres Freund

Hi,

On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
> On 2023-01-18 14:05:35 +0100, David Geier wrote:
> > @Andres: will you take care of these changes and provide me with an updated
> > patch set so I can rebase the RDTSC changes?
> > Otherwise, I can also apply Tom suggestions to your patch set and send out
> > the complete patch set.
> 
> I'm planning to push most of my changes soon, had hoped to get to it a bit
> sooner, but ...

I pushed the int64-ification commits.

> If you have time to look at the pg_test_timing part, it'd be
> appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
> hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

Thanks to Justin I've polished the pg_test_timing docs some.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

Greetings,

Andres Freund

Hi,

On 1/23/23 18:41, Andres Freund wrote:
> If we add it, it probably shouldn't depend on TIMING, but on
> SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
> something like
>    EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
>
> the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
> Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
> separate options because per-node timing is problematic performance-wise
> (right now), but whole-query timing rarely is.
Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, 
in case we'll revisit exposing the clock source in the future.
> Another, independent, thing worth thinking about: I think we might want to
> expose both rdtsc and rdtscp. For something like
> InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
> quite important to avoid changing the query performance. But for measuring
> whole-query time, we likely want to measure the actual time.
>
> It probably won't matter hugely for the whole query time - the out of order
> window of modern CPUs is large, but not *that* large - but I don't think we
> can generally assume that.

That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and 
for now call that variant from InstrStartNode(), InstrEndNode() and 
pg_test_timing. To do so in InstrEndNode(), I removed 
INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro 
would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single 
place and the code is more readable that way. INSTR_TIME_SET_CURRENT() 
is called from a bunch of places. I still have to go through all of them 
and see which should be changed to call the _FAST() variant.

Attached is v7 of the patch:

- Rebased on latest master (most importantly on top of the int64 
instr_time commits). - Includes two commits from Andres which introduce 
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report 
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds 
only with integer math, while accounting for overflow. - Supports RDTSCP 
via INSTR_TIME_SET_CURRENT() and introduced 
INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC.

I haven't gotten to the following:

- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if 
they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing 
Andres commits. Potentially improving on pg_test_timing's output. - 
Looking at enabling RDTSC on more platforms. Is there a minimum set of 
platforms we would like support for? Windows should be easy. That would 
also allow to unify the code a little more. - Add more documentation and 
do more testing around the calls to CPUID. - Profiling and optimizing 
the code. A quick test showed about 10% improvement over master with 
TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail 
that started this thread.

I hope I'll find time to work on these points during the next days.

-- 
David Geier
(ServiceNow)

Hi,

On 2/7/23 19:12, Andres Freund wrote:
> This fails to build on several platforms:
>
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751

I think I fixed the compilation errors. It was due to a few variables 
being declared under

#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no faster 
than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.

Beyond that I reviewed the first two patches a while ago in [1]. I hope 
we can progress with them to further reduce the size of this patch set.

[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

14 February 2023, 12:48:56

Hi!

On 2/14/23 12:11, David Geier wrote:
> Hi,
>
> I think I fixed the compilation errors. It was due to a few variables 
> being declared under
>
> #if defined(__x86_64__) && defined(__linux__)
>
> while being used also under non x86 Linux.
>
> I also removed again the code to obtain the TSC frequency under 
> hypervisors because the TSC is usually emulated and therefore no 
> faster than clock_gettime() anyways. So we now simply fallback to 
> clock_gettime() on hypervisors when we cannot obtain the frequency via 
> leaf 0x16.
>
> Beyond that I reviewed the first two patches a while ago in [1]. I 
> hope we can progress with them to further reduce the size of this 
> patch set.
>
> [1] 
> https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 
>
>
It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.

-- 
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From

David Geier

Date:

20 February 2023, 10:36:32

Hi!

On 2/14/23 13:48, David Geier wrote:
>
> It still fails.
>
> I'll get Cirrus-CI working on my own Github fork so I can make sure it 
> really compiles on all platforms before I submit a new version.

It took some time until Cirrus CI allowed me to run tests against my new 
GitHub account (there's a 3 days freeze to avoid people from getting 
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch 
which passes builds, rebased on latest master.

I also reviewed the first two patches a while ago in [1]. I hope we can 
progress with them to further reduce the size of this patch set.

Beyond that: I could work on support for more OSs (e.g. starting with 
Windows). Is there appetite for that or do we rather want to instead 
start with a smaller patch?

[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com

-- 
David Geier
(ServiceNow)

On Sun, Jun 2, 2024 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:

At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?

From what I can gather, it appears this was an oversight when David first reapplied the work on the instr_time changes that were committed.

I've come back to this and rebased this, as well as:

- Corrected the use of RDTSCP to RDTSC in pg_get_ticks_fast

- Check 16H register if 15H register does not contain frequency information (per research, relevant for some CPUs)

- Fixed incorrect reporting in pg_test_timing due to too small histogram (32 => 64 bits)

- Fixed indentation per pgindent

- Added support for VMs running under KVM/VMware Hypervisors

On that last item, this does indeed make a difference on VMs, contrary to the code comment in earlier versions (and I've not seen any odd behaviors again, FWIW):

On a c5.xlarge (Skylake-SP or Cascade Lake) on AWS, with the same test as done initially in this thread:

SELECT COUNT(*) FROM lotsarows;
Time: 974.423 ms

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM lotsarows;
Time: 1336.196 ms (00:01.336)

Without patch:

EXPLAIN (ANALYZE) SELECT COUNT(*) FROM lotsarows;
Time: 2165.069 ms (00:02.165)

Per loop time including overhead: 22.15 ns

With patch:

EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM lotsarows;

Time: 1654.289 ms (00:01.654)

Per loop time including overhead: 9.81 ns

I'm registering this again in the current commitfest to help reviews.

Open questions I have:

- Could we rely on checking whether the TSC timesource is invariant (via CPUID), instead of relying on Linux choosing it as a clocksource?

- For the Hypervisor CPUID checks I had to rely on __cpuidex which is only available on newer GCC versions (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95973), how do we best check for its presence? (compiler version, or rather configure check?) -- note this is also the reason the patch fails the clang compiler warning check in CI, despite clang having support in recent versions (https://reviews.llvm.org/D121653)

Thanks,

Lukas

Lukas Fittl