Thread: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN
       │
 

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1)
       │
 
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000
loops=1)│
 
│ Planning Time: 0.028 ms
       │
 
│ Execution Time: 2610.256 ms
       │
 

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.


There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not.  The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.


The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed.  If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Greetings,

Andres Freund

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Pavel Stehule
Date:


so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres@anarazel.de> napsal:
Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1)                         │
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms                                                                                                       │
│ Execution Time: 2610.256 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.


There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not.  The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.


The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed.  If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

+1

Pavel


Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Lukas Fittl
Date:
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed.  If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Based on an off-list conversation with Andres, I decided to dust off this old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to warrant
giving this a more thorough look.

See attached an updated patch (adding it to the July commitfest), with a few
changes:

- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
  INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
  because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
  are on Linux/x86, and the current kernel clocksource is TSC (the kernel has
  quite sophisticated logic around making this decision, see [1])

Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more complicated
than just checking cpuid, see [2].

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
  the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
  reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
  higher for \timing than for the EXPLAIN ANALYZE time reported on the server
  side, only when rdtsc measurement is used -- its likely there is a problem
  somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec variable
  is problematic in terms of precision (see FIXME, taken over from Andres' POC)

Open questions from me:

1) Do we need to account for different TSC offsets on different CPUs in SMP
   systems? (the Linux kernel certainly has logic to that extent, but [3]
   suggests this is no longer a problem on Nehalem and newer chips, i.e. those
   having an invariant TSC)

2) Should we have a setting "--with-tsc" for configure? (instead of always
   enabling it when on Linux/x86 with a TSC clocksource)

3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
   current instructions to finish -- the prior discussion seemed to suggest
   we don't want it for node instruction measurements, but possibly we do want
   this in other cases?)

4) Should we support using the "mrs" instruction on ARM? (which is similar to
   rdtsc, see [4])

Thanks,
Lukas

[1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2] http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3] https://stackoverflow.com/a/11060619/1652607
[4] https://cpufun.substack.com/p/fun-with-timers-and-cpuid

--
Lukas Fittl
Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > The attached patches are really just a prototype. I'm also not really
> > planning to work on getting this into a "production ready" patchset
> > anytime soon. I developed it primarily because I found it the overhead
> > made it too hard to nail down in which part of a query tree performance
> > changed.  If somebody else wants to continue from here...
> >
> > I do think it's be a pretty significant improvement if we could reduce
> > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> > bunch of low-level code.
> >
> 
> Based on an off-list conversation with Andres, I decided to dust off this
> old
> patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
> improvements (especially when using rdtsc instead of rdtsc*p*) seem to
> warrant
> giving this a more thorough look.
> 
> See attached an updated patch (adding it to the July commitfest), with a few
> changes:
> 
> - Keep using clock_gettime() as a fallback if we decide to not use rdtsc

Yep.


> - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work

I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.


> - In an abundance of caution, for now I've decided to only enable this if we
>   are on Linux/x86, and the current kernel clocksource is TSC (the kernel
> has
>   quite sophisticated logic around making this decision, see [1])

I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...


> Note that if we implemented the decision logic ourselves (instead of relying
> on the Linux kernel), I'd be most worried about older virtualization
> technology. In my understanding getting this right is notably more
> complicated
> than just checking cpuid, see [2].


> Known WIP problems with this patch version:
> 
> * There appears to be a timing discrepancy I haven't yet worked out, where
>   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
>   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
>   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> server
>   side, only when rdtsc measurement is used -- its likely there is a problem
>   somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.


> * Possibly related, the floating point handling for the cycles_to_sec
> variable
>   is problematic in terms of precision (see FIXME, taken over from Andres'
> POC)

And probably also performance...


> Open questions from me:
> 
> 1) Do we need to account for different TSC offsets on different CPUs in SMP
>    systems? (the Linux kernel certainly has logic to that extent, but [3]
>    suggests this is no longer a problem on Nehalem and newer chips, i.e.
> those
>    having an invariant TSC)

I don't think we should cater to systems where we need that.


> 2) Should we have a setting "--with-tsc" for configure? (instead of always
>    enabling it when on Linux/x86 with a TSC clocksource)

Probably not worth it.


> 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
>    current instructions to finish -- the prior discussion seemed to suggest
>    we don't want it for node instruction measurements, but possibly we do
> want
>    this in other cases?)

I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?


> 4) Should we support using the "mrs" instruction on ARM? (which is similar
> to
>    rdtsc, see [4])

I'd leave that for later personally.



>  #define NS_PER_S INT64CONST(1000000000)
>  #define US_PER_S INT64CONST(1000000)
>  #define MS_PER_S INT64CONST(1000)
> @@ -95,17 +104,37 @@ typedef int64 instr_time;
>  
>  #define INSTR_TIME_SET_ZERO(t)    ((t) = 0)
>  
> -static inline instr_time pg_clock_gettime_ns(void)
> +extern double cycles_to_sec;
> +
> +bool use_rdtsc;

This should be extern and inside the ifdef below.


> +#if defined(__x86_64__) && defined(__linux__)
> +extern void pg_clock_gettime_initialize_rdtsc(void);
> +#endif
> +
> +static inline instr_time pg_clock_gettime_ref_cycles(void)
>  {
>      struct timespec tmp;
>  
> +#if defined(__x86_64__) && defined(__linux__)
> +    if (use_rdtsc)
> +        return __rdtsc();
> +#endif
> +
>      clock_gettime(PG_INSTR_CLOCK, &tmp);
>  
>      return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
>  }
>  

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Maciek Sakrejda
Date:
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
>...
> > Known WIP problems with this patch version:
> >
> > * There appears to be a timing discrepancy I haven't yet worked out, where
> >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> >   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > server
> >   side, only when rdtsc measurement is used -- its likely there is a problem
> >   somewhere with how we perform the cycles to time conversion
>
> Could you explain a bit more what you're seeing? I just tested your patches
> and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Ibrar Ahmed
Date:


On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
>...
> > Known WIP problems with this patch version:
> >
> > * There appears to be a timing discrepancy I haven't yet worked out, where
> >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> >   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > server
> >   side, only when rdtsc measurement is used -- its likely there is a problem
> >   somewhere with how we perform the cycles to time conversion
>
> Could you explain a bit more what you're seeing? I just tested your patches
> and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek


The patch requires a rebase; please rebase the patch with the latest code.

Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm 


--
Ibrar Ahmed

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Michael Paquier
Date:
On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:
> Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
> Hunk #6 FAILED at 170.
> Hunk #7 succeeded at 165 (offset -69 lines).
> 2 out of 7 hunks FAILED -- saving rejects to file
> src/include/portability/instr_time.h.rej
> patching file src/tools/msvc/Mkvcbuild.pm

No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
I think it would be great to get this patch committed. Beyond the 
reasons already mentioned, the significant overhead also tends to skew 
the reported runtimes in ways that makes it difficult to compare them. 
For example, if two nodes are executed equally often but one needs twice 
the time to process the rows: in such a case EXPLAIN ANALYZE should 
report timings that are 2x apart. However, currently, the high overhead 
of clock_gettime() tends to skew the relative runtimes.

On 10/12/22 10:33, Michael Paquier wrote:
> No rebased version has been sent since this update, so this patch has
> been marked as RwF.

I've rebased the patch set on latest master and fixed a few compiler 
warnings. Beyond that some findings and thoughts:

You're only using RDTSC if the clock source is 'tsc'. Great idea to not 
bother caring about a lot of hairy TSC details. Looking at the kernel 
code this seems to imply that the TSC is frequency invariant. I don't 
think though that this implies that Linux is not running under a 
hypervisor; which is good because I assume PostgreSQL is used a lot in 
VMs. However, when running under a hypervisor (at least with VMWare) 
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated 
success but the returned values were garbage. Instead of using leaf 
0x16, we should then use the hypervisor interface to obtain the TSC 
frequency. Checking if a hypervisor is active can be done via:

bool IsHypervisorActive()
{
     uint32 cpuinfo[4] = {0};
     int res = __get_cpuid(0x1, &cpuinfo[0], &cpuinfo[1], &cpuinfo[2], 
&cpuinfo[3]);
     return res > 0 && (cpuinfo[2] & (1 << 30));
}

Obtaining the TSC frequency via the hypervisor interface can be done 
with the following code. See https://lwn.net/Articles/301888/ for more 
details.

// Under hypervisors (tested with VMWare) leaf 0x16 is not available, 
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to 
obtain TSC frequency.
uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x40000001, &cpuinfo[0], 
&cpuinfo[1], &cpuinfo[2], &cpuinfo[3]) > 0)
     cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);

Given that we anyways switch between RDTSC and clock_gettime() with a 
global variable, what about exposing the clock source as GUC? That way 
the user can switch back to a working clock source in case we miss a 
detail around activating or reading the TSC.

I'm happy to update the patches accordingly.

--
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
I missed attaching the patches.

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

I re-based again on master and applied the following changes:

I removed the fallback for obtaining the TSC frequency from /proc/cpu as 
suggested by Andres. Worst-case we fall back to clock_gettime().

I added code to obtain the TSC frequency via CPUID when under a 
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), 
because __get_cpuid() returns an error if the leaf is > 0x80000000 
(probably the implementation pre-dates the hypervisor timing leafs). 
Unfortunately, while testing my implementation under VMWare, I found 
that RDTSC runs awfully slow there (like 30x slower). [1] indicates that 
we cannot generally rely on RDTSC being actually fast on VMs. However, 
the same applies to clock_gettime(). It runs as slow as RDTSC on my 
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not 
entirely sure if there aren't cases where e.g. clock_gettime() is 
actually faster than RDTSC and it would be advantageous to use 
clock_gettime(). We could add a GUC so that the user can decide which 
clock source to use. Any thoughts?

I also somewhat improved the accuracy of the cycles to milli- and 
microseconds conversion functions by having two more multipliers with 
higher precision. For microseconds we could also keep the computation 
integer-only. I'm wondering what to best do for seconds and 
milliseconds. I'm currently leaning towards just keeping it as is, 
because the durations measured and converted are usually long enough 
that precision shouldn't be a problem.

In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I 
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I 
initialized a few variables of type instr_time which otherwise resulted 
in warnings due to use of potentially uninitialized variables.

I also couldn't reproduce the reported timing discrepancy. For me the 
runtime reported by \timing is just slightly higher than the time 
reported by EXPLAIN ANALYZE, which is expected.

Beyond that:

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so 
that it's consistent with the _MILLISEC() and _MICROSEC() variants?

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants 
return double. This seems error prone. What about renaming the function 
or also have the function return a double and cast where necessary at 
the call site?

If no one objects I would also re-register this patch in the commit fest.

[1] 
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf 
(page 11 "Virtual TSC")

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Lukas Fittl
Date:
Hi David,

Thanks for continuing to work on this patch, and my apologies for silence on the patch.

Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities.

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?

Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values.
 
If no one objects I would also re-register this patch in the commit fest.

+1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch.

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead.


Thanks,
Lukas

--
Lukas Fittl

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Maciek Sakrejda
Date:
On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
> On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> >...
> > > Known WIP problems with this patch version:
> > >
> > > * There appears to be a timing discrepancy I haven't yet worked out, where
> > >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> > >   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> > >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > > server
> > >   side, only when rdtsc measurement is used -- its likely there is a problem
> > >   somewhere with how we perform the cycles to time conversion
> >
> > Could you explain a bit more what you're seeing? I just tested your patches
> > and didn't see that here.
>
> I did not see this either, but I did see that the execution time
> reported by \timing is (for this test case) consistently 0.5-1ms
> *lower* than the Execution Time reported by EXPLAIN. I did not see
> that on master. Is that expected?

For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:
> Thanks for continuing to work on this patch, and my apologies for 
> silence on the patch.

It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as 
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().

I rebased again on master because of [1]. Patches attached.

>
> Its been hard to make time, and especially so because I typically 
> develop on an ARM-based macOS system where I can't test this directly 
> - hence my tests with virtualized EC2 instances, where I ran into the 
> timing oddities.
That's good and bad. Bad to do the development and good to test the 
implementation on more virtualized setups; given that I also encountered 
"interesting" behavior on VMWare (see my previous mails).
>
> On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
>
>     The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
>     variants
>     return double. This seems error prone. What about renaming the
>     function
>     or also have the function return a double and cast where necessary at
>     the call site?
>
>
> Minor note, but in my understanding using a uint64 (where we can) is 
> faster for any simple arithmetic we do with the values.

That's true. So the argument could be that for seconds and milliseconds 
we want the extra precision while microseconds are precise enough. 
Still, we could also make the seconds and milliseconds conversion code 
integer only and e.g. return two integers with the value before and 
after the comma. FWICS, the functions are nowhere used in performance 
critical code, so it doesn't really make a difference performance-wise.

>
> +1, and feel free to carry this patch forward - I'll try to make an 
> effort to review my earlier testing issues again, as well as your 
> later improvements to the patch.
Moved to the current commit fest. Will you become reviewer?
>
> Also, FYI, I just posted an alternate idea for speeding up EXPLAIN 
> ANALYZE with timing over in [0], using a sampling-based approach to 
> reduce the timing overhead.

Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1] 
https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
vignesh C
Date:
On Tue, 3 Jan 2023 at 14:08, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi Lukas,
>
> On 1/2/23 20:50, Lukas Fittl wrote:
> > Thanks for continuing to work on this patch, and my apologies for
> > silence on the patch.
>
> It would be great if you could review it.
> Please also share your thoughts around exposing the used clock source as
> GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
>
> I rebased again on master because of [1]. Patches attached.
>
> >
> > Its been hard to make time, and especially so because I typically
> > develop on an ARM-based macOS system where I can't test this directly
> > - hence my tests with virtualized EC2 instances, where I ran into the
> > timing oddities.
> That's good and bad. Bad to do the development and good to test the
> implementation on more virtualized setups; given that I also encountered
> "interesting" behavior on VMWare (see my previous mails).
> >
> > On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
> >
> >     The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
> >     variants
> >     return double. This seems error prone. What about renaming the
> >     function
> >     or also have the function return a double and cast where necessary at
> >     the call site?
> >
> >
> > Minor note, but in my understanding using a uint64 (where we can) is
> > faster for any simple arithmetic we do with the values.
>
> That's true. So the argument could be that for seconds and milliseconds
> we want the extra precision while microseconds are precise enough.
> Still, we could also make the seconds and milliseconds conversion code
> integer only and e.g. return two integers with the value before and
> after the comma. FWICS, the functions are nowhere used in performance
> critical code, so it doesn't really make a difference performance-wise.
>
> >
> > +1, and feel free to carry this patch forward - I'll try to make an
> > effort to review my earlier testing issues again, as well as your
> > later improvements to the patch.
> Moved to the current commit fest. Will you become reviewer?
> >
> > Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
> > ANALYZE with timing over in [0], using a sampling-based approach to
> > reduce the timing overhead.
>
> Interesting idea. I'll reply with some thoughts on the corresponding thread.
>
> [1]
> https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

> CFBot shows some compilation errors as in [1], please post an updated
> version for the same:
> 09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
> relocation against `cycles_to_sec' in read-only section `.text'
> [09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
> function `pg_clock_gettime_ref_cycles':
> [09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
> undefined reference to `use_rdtsc'
> [09:08:12.525] /usr/bin/ld:
> src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
> function `test_timing':
> [09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
> undefined reference to `pg_clock_gettime_initialize_rdtsc'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
> undefined reference to `cycles_to_us'
> [09:08:12.525] /usr/bin/ld:
> /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
> undefined reference to `cycles_to_sec'
> [09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
> [09:08:12.525] collect2: error: ld returned 1 exit status
>
> [1] - https://cirrus-ci.com/task/5375312565895168
>
> Regards,
> Vignesh

I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-04 13:02:05 +0100, David Geier wrote:
> From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
> From: David Geier <geidav.pg@gmail.com>
> Date: Thu, 17 Nov 2022 10:22:01 +0100
> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>  cheaper.

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).

While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:

(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years

Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.

It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.


This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.

For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT generate_series(1, 10000000) OFFSET
10000000;")

the best result is:
timespec: 1073.431
nanosec:  957.532
a ~10% difference

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-01-04 13:02:05 +0100, David Geier wrote:
>> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>> cheaper.

> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds.

Cheaper, and perhaps more accurate too?  Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.

I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.

            regards, tom lane



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-13 15:25:16 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds.
>
> Cheaper, and perhaps more accurate too?  Don't recall if we have any code
> paths where the input timestamps are likely to be better-than-microsecond,
> but surely that's coming someday.

instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?


> I'm unsure that we want to deal with rdtsc's vagaries in general, but
> no objection to changing instr_time.

Cool.

Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
vignesh C
Date:
On Wed, 4 Jan 2023 at 17:32, David Geier <geidav.pg@gmail.com> wrote:
>
> I fixed the compilation error on CFBot.
> I missed adding instr_time.c to the Meson makefile.
> New patch set attached.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch
....
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_3751.log

Regards,
Vignesh



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-13 11:55:47 -0800, Andres Freund wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

This actually under-counted the benefits, because we have two BufferUsage and
two WalUsage in Instrumentation.

Before:
        /* size: 448, cachelines: 7, members: 20 */
        /* sum members: 445, holes: 1, sum holes: 3 */
After
        /* size: 368, cachelines: 6, members: 20 */
        /* sum members: 365, holes: 1, sum holes: 3 */


The difference in the number of instructions in InstrStopNode is astounding:
1016 instructions with timespec, 96 instructions with nanoseconds. Some of
that is the simpler data structure, some because the compiler now can
auto-vectorize the four INSTR_TIME_ACCUM_DIFF in BufferUsageAccumDiff into
one.

We probably should convert Instrumentation->firsttuple to a instr_time now as
well, no point in having the code for conversion to double in the hot routine,
that can easily happen in explain. But that's for a later patch.


I suggested downthread that we should convert the win32 implementation to be
more similar to the unix-nanoseconds representation. A blind conversion looks
good, and lets us share a number of macros.


I wonder if we should deprecate INSTR_TIME_IS_ZERO()/INSTR_TIME_SET_ZERO() and
allow 0 to be used instead. Not needing INSTR_TIME_SET_ZERO() allows variable
definitions to initialize the value, which does avoid some unnecessarily
awkward code.  Alternatively we could introduce INSTR_TIME_ZERO() for that
purpose?


Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-02 14:28:20 +0100, David Geier wrote:
> I also somewhat improved the accuracy of the cycles to milli- and
> microseconds conversion functions by having two more multipliers with higher
> precision. For microseconds we could also keep the computation integer-only.
> I'm wondering what to best do for seconds and milliseconds. I'm currently
> leaning towards just keeping it as is, because the durations measured and
> converted are usually long enough that precision shouldn't be a problem.

I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.

Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.

I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.


> Additionally, I initialized a few variables of type instr_time which
> otherwise resulted in warnings due to use of potentially uninitialized
> variables.

Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.


> What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
> it's consistent with the _MILLISEC() and _MICROSEC() variants?

> The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
> return double. This seems error prone. What about renaming the function or
> also have the function return a double and cast where necessary at the call
> site?

I think those should be a separate discussion / patch.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Tomas Vondra
Date:
Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Pavel Stehule
Date:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal:
Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.

+1

Pavel



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Robert Haas
Date:
On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

I read through 0001 and it seems basically fine to me. Comments:

1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.

3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
> On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> > Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> > BufferUsage).

Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:

0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
      warn

      Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
      just allow to assign 0.

0002) Convert instr_time to uint64

      This is the cleaned up version of the prior patch. The main change is
      that it deduplicated a lot of the code between the architectures.

0003) Add INSTR_TIME_SET_SECOND()

      This is used in 0004. Just allows setting an instr_time to a time in
      seconds, allowing for a cheaper loop exit condition in 0004.

0004) report nanoseconds in pg_test_timing


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?


> I read through 0001 and it seems basically fine to me. Comments:
>
> 1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

Fixed.


> 2. I'm not entirely sure that the new .?S_PER_.?S macros are
> worthwhile but maybe they are, and in any case I don't care very much.

There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.


> 3. I've always found 'struct timespec' to be pretty annoying
> notationally, so I like the fact that this patch would reduce use of
> it.

Same.

Greetings,

Andres Freund

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Here's an updated version of the move to representing instr_time as
> nanoseconds. It's now split into a few patches:

I took a quick look through this.

> 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
>       warn
>       Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
>       just allow to assign 0.

I think it's probably wise to keep the macro.  If we ever rethink this
again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.  I'm almost tempted
to suggest you define instr_time as a struct with a uint64 field,
just to help keep us honest about that.

> 0003) Add INSTR_TIME_SET_SECOND()
>       This is used in 0004. Just allows setting an instr_time to a time in
>       seconds, allowing for a cheaper loop exit condition in 0004.

Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS.  I think I prefer the latter, but don't care that much.

> 0004) report nanoseconds in pg_test_timing

Didn't examine 0004 in any detail, but the others look good to go
other than these nits.

            regards, tom lane



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Here's an updated version of the move to representing instr_time as
> > nanoseconds. It's now split into a few patches:
> 
> I took a quick look through this.

Thanks!


> > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> >       warn
> >       Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> >       just allow to assign 0.
> 
> I think it's probably wise to keep the macro.  If we ever rethink this
> again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> even if it would work with just compare-to-zero.

Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?


> I'm almost tempted to suggest you define instr_time as a struct with a
> uint64 field, just to help keep us honest about that.

I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.


> > 0003) Add INSTR_TIME_SET_SECOND()
> >       This is used in 0004. Just allows setting an instr_time to a time in
> >       seconds, allowing for a cheaper loop exit condition in 0004.
> 
> Code and comments are inconsistent about whether it's SET_SECOND or
> SET_SECONDS.  I think I prefer the latter, but don't care that much.

That's probably because I couldn't decide... So I'll go with your preference.


> > 0004) report nanoseconds in pg_test_timing
> 
> Didn't examine 0004 in any detail, but the others look good to go
> other than these nits.

Thanks for looking!

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
On 1/16/23 21:39, Pavel Stehule wrote:
>
> po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
> <tomas.vondra@enterprisedb.com> napsal:
>
>     Hi,
>
>     there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.
>
>     As for the patch, I don't have much comments. I'm wondering if it'd be
>     useful to indicate which timing source was actually used for EXPLAIN
>     ANALYZE, say something like:
>
>      Planning time: 0.197 ms
>      Execution time: 0.225 ms
>      Timing source: clock_gettime (or tsc)
>
>     There has been a proposal to expose this as a GUC (or perhaps as
>     explain
>     option), to allow users to pick what timing source to use. I
>     wouldn't go
>     that far - AFAICS is this is meant to be universally better when
>     available. But knowing which source was used seems useful.
>
>
> +1

Thanks for looking at the patch.

I'll fix the merge conflict.

I like the idea of exposing the timing source in the EXPLAIN ANALYZE output.
It's a good tradeoff between inspectability and effort, given that RDTSC 
should always be better to use.
If there are no objections I go this way.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
On 1/16/23 18:37, Andres Freund wrote:
> Hi,
>
> On 2023-01-02 14:28:20 +0100, David Geier wrote:
>
> I'm doubtful this is worth the complexity it incurs. By the time we convert
> out of the instr_time format, the times shouldn't be small enough that the
> accuracy is affected much.
I don't feel strong about it and you have a point that we most likely 
only convert ones we've accumulated a fair amount of cycles.
> Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
> actually accumulate themselves, and should instead keep things in the
> instr_time format and convert later. We'd win more accuracy / speed that way.
>
> I don't think the introduction of pg_time_usec_t was a great idea, but oh
> well.
Fully agreed. Why not replacing pg_time_usec_t with instr_time in a 
separate patch? I haven't looked carefully enough if all occurrences 
could sanely replaced but at least the ones that accumulate time seem 
good starting points.
>> Additionally, I initialized a few variables of type instr_time which
>> otherwise resulted in warnings due to use of potentially uninitialized
>> variables.
> Unless we decide, as I suggested downthread, that we deprecate
> INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
> patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
I don't feel strong about it, but like Tom tend towards keeping the 
initialization macro.
Thanks that you have improved on the first patch and fixed these issues 
in a better way.
>> What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
>> it's consistent with the _MILLISEC() and _MICROSEC() variants?
>> The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
>> return double. This seems error prone. What about renaming the function or
>> also have the function return a double and cast where necessary at the call
>> site?
> I think those should be a separate discussion / patch.

OK. I'll propose follow-on patches ones we're done with the ones at hand.

I'll then rebase the RDTSC patches on your patch set.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

@Andres: will you take care of these changes and provide me with an 
updated patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send 
out the complete patch set.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi Andres,

> I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
> how to do the cycles->nanosecond conversion with integer shift and multiply in
> the common case, which does show a noticable speedup. But that's for another
> day.
I also have code for that here. I decided against integrating it because 
we don't convert frequently enough to make it matter. Or am I missing 
something?
> I fought a bit with myself about whether to send those patches in this thread,
> because it'll take over the CF entry. But decided that it's ok, given that
> David's patches should be rebased over these anyway?
That's alright.
Though, I would hope we attempt to bring your patch set as well as the 
RDTSC patch set in.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
On 1/18/23 13:52, David Geier wrote:
> On 1/16/23 21:39, Pavel Stehule wrote:
>>
>> po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
>> <tomas.vondra@enterprisedb.com> napsal:
>>
>>     Hi,
>>
>>     there's minor bitrot in the Mkvcbuild.pm change, making cfbot 
>> unhappy.
>>
>>     As for the patch, I don't have much comments. I'm wondering if 
>> it'd be
>>     useful to indicate which timing source was actually used for EXPLAIN
>>     ANALYZE, say something like:
>>
>>      Planning time: 0.197 ms
>>      Execution time: 0.225 ms
>>      Timing source: clock_gettime (or tsc)
>>
>> +1
>
> I like the idea of exposing the timing source in the EXPLAIN ANALYZE 
> output.
> It's a good tradeoff between inspectability and effort, given that 
> RDTSC should always be better to use.
> If there are no objections I go this way.
Thinking about this a little more made me realize that this will cause 
different pg_regress output depending on the platform. So if we go this 
route we would at least need an option for EXPLAIN ANALYZE to disable 
it. Or rather have it disabled by default and allow for enabling it. 
Thoughts?

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Tomas Vondra
Date:

On 1/20/23 07:43, David Geier wrote:
> On 1/18/23 13:52, David Geier wrote:
>> On 1/16/23 21:39, Pavel Stehule wrote:
>>>
>>> po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> napsal:
>>>
>>>     Hi,
>>>
>>>     there's minor bitrot in the Mkvcbuild.pm change, making cfbot
>>> unhappy.
>>>
>>>     As for the patch, I don't have much comments. I'm wondering if
>>> it'd be
>>>     useful to indicate which timing source was actually used for EXPLAIN
>>>     ANALYZE, say something like:
>>>
>>>      Planning time: 0.197 ms
>>>      Execution time: 0.225 ms
>>>      Timing source: clock_gettime (or tsc)
>>>
>>> +1
>>
>> I like the idea of exposing the timing source in the EXPLAIN ANALYZE
>> output.
>> It's a good tradeoff between inspectability and effort, given that
>> RDTSC should always be better to use.
>> If there are no objections I go this way.
> Thinking about this a little more made me realize that this will cause
> different pg_regress output depending on the platform. So if we go this
> route we would at least need an option for EXPLAIN ANALYZE to disable
> it. Or rather have it disabled by default and allow for enabling it.
> Thoughts?
> 

What about only showing it for VERBOSE mode? I don't think there are
very many tests doing EXPLAIN (ANALYZE, VERBOSE) - a quick grep found
one such place in partition_prune.sql.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-17 10:50:53 -0800, Andres Freund wrote:
> On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> > >       warn
> > >       Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> > >       just allow to assign 0.
> >
> > I think it's probably wise to keep the macro.  If we ever rethink this
> > again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> > even if it would work with just compare-to-zero.
>
> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
> could give us the best of both worlds?

I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.

Comments?


> > I'm almost tempted to suggest you define instr_time as a struct with a
> > uint64 field, just to help keep us honest about that.
>
> I can see that making sense. Unless somebody pipes up with opposition to that
> plan soon, I'll see how it goes.

Done in the attached. I think it looks good. Actually found a type confusion
buglet in 0004, so the type safety benefit is noticable.

It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004,
but that seems fine.


Besides cosmetic stuff I also added back the cast to double in window's
INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it.

We should make this faster by pre-computing
  (double) NS_PER_S / GetTimerFrequency()
once, as that'd avoid doing the the slow division on every conversion. But
that's an old issue and thus better tackled separately.

Greetings,

Andres Freund

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
>> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
>> could give us the best of both worlds?

> I tried that in the attached 0005. I found that it reads better if I also add
> INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
> instead, but I wanted to get agreement on it first.

-1 from here.  This forecloses the possibility that it's best to use more
than one assignment to initialize the value, and the code doesn't read
any better than it did before.

            regards, tom lane



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-20 22:27:07 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
> >> could give us the best of both worlds?
> 
> > I tried that in the attached 0005. I found that it reads better if I also add
> > INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
> > instead, but I wanted to get agreement on it first.
> 
> -1 from here.  This forecloses the possibility that it's best to use more
> than one assignment to initialize the value, and the code doesn't read
> any better than it did before.

I think it does read a bit better, but it's a pretty small improvement. So
I'll leave this aspect be for now.

Thanks for checking.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-19 11:47:49 +0100, David Geier wrote:
> > I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
> > how to do the cycles->nanosecond conversion with integer shift and multiply in
> > the common case, which does show a noticable speedup. But that's for another
> > day.
> I also have code for that here. I decided against integrating it because we
> don't convert frequently enough to make it matter. Or am I missing
> something?

We do currently do the conversion quite frequently.  Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)

But I think it's a real issue. Places where we do, but shouldn't, convert:

- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
- PendingWalStats.wal_write_time - this is particularly bad because it happens
  within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
  they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...

These all will get a bit slower when moving to a "variable" frequency.


What was your approach for avoiding the costly operation?  I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.


> > I fought a bit with myself about whether to send those patches in this thread,
> > because it'll take over the CF entry. But decided that it's ok, given that
> > David's patches should be rebased over these anyway?
> That's alright.
> Though, I would hope we attempt to bring your patch set as well as the RDTSC
> patch set in.

I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.

I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-20 07:43:00 +0100, David Geier wrote:
> On 1/18/23 13:52, David Geier wrote:
> > On 1/16/23 21:39, Pavel Stehule wrote:
> > > 
> > > po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> napsal:
> > > 
> > >     Hi,
> > > 
> > >     there's minor bitrot in the Mkvcbuild.pm change, making cfbot
> > > unhappy.
> > > 
> > >     As for the patch, I don't have much comments. I'm wondering if
> > > it'd be
> > >     useful to indicate which timing source was actually used for EXPLAIN
> > >     ANALYZE, say something like:
> > > 
> > >      Planning time: 0.197 ms
> > >      Execution time: 0.225 ms
> > >      Timing source: clock_gettime (or tsc)
> > > 
> > > +1
> > 
> > I like the idea of exposing the timing source in the EXPLAIN ANALYZE
> > output.
> > It's a good tradeoff between inspectability and effort, given that RDTSC
> > should always be better to use.
> > If there are no objections I go this way.
> Thinking about this a little more made me realize that this will cause
> different pg_regress output depending on the platform. So if we go this
> route we would at least need an option for EXPLAIN ANALYZE to disable it. Or
> rather have it disabled by default and allow for enabling it. Thoughts?

The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.

But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-18 14:05:35 +0100, David Geier wrote:
> @Andres: will you take care of these changes and provide me with an updated
> patch set so I can rebase the RDTSC changes?
> Otherwise, I can also apply Tom suggestions to your patch set and send out
> the complete patch set.

I'm planning to push most of my changes soon, had hoped to get to it a bit
sooner, but ...

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-18 14:02:48 +0100, David Geier wrote:
> On 1/16/23 18:37, Andres Freund wrote:
> > I'm doubtful this is worth the complexity it incurs. By the time we convert
> > out of the instr_time format, the times shouldn't be small enough that the
> > accuracy is affected much.
>
> I don't feel strong about it and you have a point that we most likely only
> convert ones we've accumulated a fair amount of cycles.

I think we can avoid the issue another way. The inaccuracy comes from the
cycles_to_sec ending up very small, right? Right now your patch has (and
probably my old version similarly had):

cycles_to_sec = 1.0 / (tsc_freq * 1000);

I think it's better if we have one multiplier to convert cycles to nanoseconds
- that'll be a double comparatively close to 1. We can use that to implement
INSTR_TIME_GET_NANOSECONDS(). The conversion to microseconds then is just a
division by 1000 (which most compilers convert into a multiplication/shift
combo), and the conversions to milliseconds and seconds will be similar.

Because we'll never "wrongly" go into the "huge number" or "very small number"
ranges, that should provide sufficient precision? We'll of course still end up
with a very small number when converting a few nanoseconds to seconds, but
that's ok because it's the precision being asked for, instead of loosing
precision in some intermediate representation.


> > Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
> > actually accumulate themselves, and should instead keep things in the
> > instr_time format and convert later. We'd win more accuracy / speed that way.
> > 
> > I don't think the introduction of pg_time_usec_t was a great idea, but oh
> > well.
> Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate
> patch?

pgbench used to use instr_time, but it was replaced by somebody thinking the
API is too cumbersome. Which I can't quite deny, even though I think the
specific change isn't great.

But yes, this should definitely be a separate patch.


Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Justin Pryzby
Date:
On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
> From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Mon, 16 Jan 2023 11:19:11 -0800
> Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing

>    <para>
> -   The i7-860 system measured runs the count query in 9.8 ms while
> -   the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
> -   processing just over 100,000 rows.  That 6.8 ms difference means the timing
> -   overhead per row is 68 ns, about twice what pg_test_timing estimated it
> -   would be.  Even that relatively small amount of overhead is making the fully
> -   timed count statement take almost 70% longer.  On more substantial queries,
> -   the timing overhead would be less problematic.
> +   The i9-9880H system measured shows an execution time of 4.116 ms for the
> +   <literal>TIMING OFF</literal> query, and 6.965 ms for the
> +   <literal>TIMING ON</literal>, each processing 100,000 rows.
> +
> +   That 2.849 ms difference means the timing overhead per row is 28 ns.  As
> +   <literal>TIMING ON</literal> measures timestamps twice per row returned by
> +   an executor node, the overhead is very close to what pg_test_timing
> +   estimated it would be.
> +
> +   more than what pg_test_timing estimated it would be.  Even that relatively
> +   small amount of overhead is making the fully timed count statement take
> +   about 60% longer.  On more substantial queries, the timing overhead would
> +   be less problematic.

I guess you intend to merge these two paragraphs ?



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
On 2023-01-20 22:50:37 -0600, Justin Pryzby wrote:
> On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
> > From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Mon, 16 Jan 2023 11:19:11 -0800
> > Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing
> 
> >    <para>
> > -   The i7-860 system measured runs the count query in 9.8 ms while
> > -   the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
> > -   processing just over 100,000 rows.  That 6.8 ms difference means the timing
> > -   overhead per row is 68 ns, about twice what pg_test_timing estimated it
> > -   would be.  Even that relatively small amount of overhead is making the fully
> > -   timed count statement take almost 70% longer.  On more substantial queries,
> > -   the timing overhead would be less problematic.
> > +   The i9-9880H system measured shows an execution time of 4.116 ms for the
> > +   <literal>TIMING OFF</literal> query, and 6.965 ms for the
> > +   <literal>TIMING ON</literal>, each processing 100,000 rows.
> > +
> > +   That 2.849 ms difference means the timing overhead per row is 28 ns.  As
> > +   <literal>TIMING ON</literal> measures timestamps twice per row returned by
> > +   an executor node, the overhead is very close to what pg_test_timing
> > +   estimated it would be.
> > +
> > +   more than what pg_test_timing estimated it would be.  Even that relatively
> > +   small amount of overhead is making the fully timed count statement take
> > +   about 60% longer.  On more substantial queries, the timing overhead would
> > +   be less problematic.
> 
> I guess you intend to merge these two paragraphs ?

Oops. I was intending to drop the last paragraph.

Looking at the docs again I noticed that I needed to rephrase the 'acpi_pm'
section further, as I'd left the "a small multiple of what's measured directly
by this utility" language in there.

Do the changes otherwise make sense?

The "small multiple" stuff was just due to a) comparing "raw statement" with
explain analyze b) not accounting for two timestamps being taken per row.

I think it makes sense to remove the "jiffies" section - the output shown is
way outdated. And I don't think the jiffies time counter is one something
still sees in the wild, outside of bringing up a new cpu architecture or such.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
> On 2023-01-18 14:05:35 +0100, David Geier wrote:
> > @Andres: will you take care of these changes and provide me with an updated
> > patch set so I can rebase the RDTSC changes?
> > Otherwise, I can also apply Tom suggestions to your patch set and send out
> > the complete patch set.
> 
> I'm planning to push most of my changes soon, had hoped to get to it a bit
> sooner, but ...

I pushed the int64-ification commits.


> If you have time to look at the pg_test_timing part, it'd be
> appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
> hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

Thanks to Justin I've polished the pg_test_timing docs some.


I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

Greetings,

Andres Freund

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-20 21:31:57 -0800, Andres Freund wrote:
> On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
> > I'm planning to push most of my changes soon, had hoped to get to it a bit
> > sooner, but ...
>
> I pushed the int64-ification commits.

There's an odd compilation failure on AIX.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2023-01-21%2007%3A01%3A42

/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432
-qsuppress=1500-010:1506-995-qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264
-qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse-qinfo=nounset  -qvisibility=hidden -I. -I.
-I/opt/freeware/include/python3.5m-I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include
-I/home/nm/sw/nopath/libxml2-64/include/libxml2 -I/home/nm/sw/nopath/uuid-64/include
-I/home/nm/sw/nopath/openldap-64/include-I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include
-c -o plpy_cursorobject.o plpy_cursorobject.c
 
"../../../src/include/portability/instr_time.h", line 116.9: 1506-304 (I) No function prototype given for
"clock_gettime".
"../../../src/include/portability/instr_time.h", line 116.23: 1506-045 (S) Undeclared identifier CLOCK_REALTIME.
<builtin>: recipe for target 'plpy_cursorobject.o' failed

but files including instr_time.h *do* build successfully, e.g. instrument.c:

/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432
-qsuppress=1500-010:1506-995-qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264
-qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse-qinfo=nounset -I../../../src/include
-I/home/nm/sw/nopath/icu58.3-64/include  -I/home/nm/sw/nopath/libxml2-64/include/libxml2
-I/home/nm/sw/nopath/uuid-64/include-I/home/nm/sw/nopath/openldap-64/include -I/home/nm/sw/nopath/icu58.3-64/include
-I/home/nm/sw/nopath/libxml2-64/include -c -o instrument.o instrument.c
 


Before the change the clock_gettime() call was in a macro and thus could be
referenced even without a prior declaration, as long as places using
INSTR_TIME_SET_CURRENT() had all the necessary includes and defines.


Argh:

There's nice bit in plpython.h:

/*
 * Include order should be: postgres.h, other postgres headers, plpython.h,
 * other plpython headers.  (In practice, other plpython headers will also
 * include this file, so that they can compile standalone.)
 */
#ifndef POSTGRES_H
#error postgres.h must be included before plpython.h
#endif

/*
 * Undefine some things that get (re)defined in the Python headers. They aren't
 * used by the PL/Python code, and all PostgreSQL headers should be included
 * earlier, so this should be pretty safe.
 */
#undef _POSIX_C_SOURCE
#undef _XOPEN_SOURCE


the relevant stuff in time.h is indeed guarded by
#if _XOPEN_SOURCE>=500


I don't think the plpython actually code follows the rule about including all
postgres headers earlier.

plpy_typeio.h:

#include "access/htup.h"
#include "fmgr.h"
#include "plpython.h"
#include "utils/typcache.h"

plpy_curserobject.c:

#include "access/xact.h"
#include "catalog/pg_type.h"
#include "mb/pg_wchar.h"
#include "plpy_cursorobject.h"
#include "plpy_elog.h"
#include "plpy_main.h"
#include "plpy_planobject.h"
#include "plpy_procedure.h"
#include "plpy_resultobject.h"
#include "plpy_spi.h"
#include "plpython.h"
#include "utils/memutils.h"


It strikes me as a uh, not good idea to undefine _POSIX_C_SOURCE,
_XOPEN_SOURCE.

The include order aspect was perhaps feasible when there just was plpython.c,
but with the split into many different C files and many headers, it seems hard
to maintain. There's a lot of violations afaics.

The undefines were added in a11cf433413, the split in 147c2482542.



Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 1/21/23 05:14, Andres Freund wrote:
> The elapsed time is already inherently unstable, so we shouldn't have any test
> output showing the time.
>
> But I doubt showing it in every explain is a good idea - we use instr_time in
> plenty of other places. Why show it in explain, but not in all those other
> places?

Yeah. I thought it would only be an issue if we showed it 
unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON, 
we're likely fine with pretty much all regression tests.

But given the different opinions, I'll leave it out in the new patch set 
for the moment being.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-23 18:23:17 +0100, David Geier wrote:
> On 1/21/23 05:14, Andres Freund wrote:
> > The elapsed time is already inherently unstable, so we shouldn't have any test
> > output showing the time.
> > 
> > But I doubt showing it in every explain is a good idea - we use instr_time in
> > plenty of other places. Why show it in explain, but not in all those other
> > places?
> 
> Yeah. I thought it would only be an issue if we showed it unconditionally in
> EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with
> pretty much all regression tests.

If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)

the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.


> But given the different opinions, I'll leave it out in the new patch set for
> the moment being.

Makes sense.


Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.

It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.

I'm thinking of something like INSTR_TIME_SET_CURRENT() and
INSTR_TIME_SET_CURRENT_FAST() or _NOBARRIER().

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 1/21/23 05:12, Andres Freund wrote:
> We do currently do the conversion quite frequently.  Admittedly I was
> partially motivated by trying to get the per-loop overhead in pg_test_timing
> down ;)
>
> But I think it's a real issue. Places where we do, but shouldn't, convert:
>
> - ExecReScan() - quite painful, we can end up with a lot of those
> - InstrStopNode() - adds a good bit of overhead to simple
InstrStopNode() doesn't convert in the general case but only for the 
first tuple or when async. So it goes somewhat hand in hand with 
ExecReScan().
> - PendingWalStats.wal_write_time - this is particularly bad because it happens
>    within very contended code
> - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
>    they can be very frequent
> - pgbench.c, as we already discussed
> - pg_stat_statements.c
> - ...
>
> These all will get a bit slower when moving to a "variable" frequency.
I wonder if we will be able to measure any of them easily. But given 
that it's many more places than I had realized and given that the 
optimized code is not too involved, let's give it a try.
> What was your approach for avoiding the costly operation?  I ended up with a
> integer multiplication + shift approximation for the floating point
> multiplication (which in turn uses the inverse of the division by the
> frequency). To allow for sufficient precision while also avoiding overflows, I
> had to make that branch conditional, with a slow path for large numbers of
> nanoseconds.

It seems like we ended up with the same. I do:

sec = ticks / frequency_hz
ns  = ticks / frequency_hz * 1,000,000,000
ns  = ticks * (1,000,000,000 / frequency_hz)
ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz

Now, the constant scaling factor in parentheses is typically a floating 
point number. For example for a frequency of 2.5 GHz it would be 2.5. To 
work around that we can do something like:

ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler

Where scaler is a power-of-2, big enough to maintain enough precision 
while allowing for a shift to implement the division.

The additional multiplication with scaler makes that the maximum range 
go down, because we must ensure we never overflow. I'm wondering if we 
cannot pick scaler in such a way that remaining range of cycles is large 
enough for our use case and we can therefore live without bothering for 
the overflow case. What would be "enough"? 1 year? 10 years? ...

Otherwise, we indeed need code that cares for the potential overflow. My 
hunch is that it can be done branchless, but it for sure adds dependent 
instructions. Maybe in that case a branch is better that almost 
certainly will never be taken?

I'll include the code in the new patch set which I'll latest submit 
tomorrow.

> I think it'd be great - but I'm not sure we're there yet, reliability and
> code-complexity wise.
Thanks to your commits, the diff of the new patch set will be already 
much smaller and easier to review. What's your biggest concern in terms 
of reliability?
> I think it might be worth makign the rdts aspect somewhat
> measurable. E.g. allowing pg_test_timing to use both at the same time, and
> have it compare elapsed time with both sources of counters.
I haven't yet looked into pg_test_timing. I'll do that while including 
your patches into the new patch set.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 1/21/23 06:31, Andres Freund wrote:
> I pushed the int64-ification commits.

Great. I started rebasing.

One thing I was wondering about: why did you chose to use a signed 
instead of an unsigned 64-bit integer for the ticks?
>> If you have time to look at the pg_test_timing part, it'd be
>> appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
>> hesitant to push it.
> I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
> patch.
>
> I've attached those two patches. Feel free to include them in your series if
> you want, then the CF entry (and thus cfbot) makes sense again...
I'll include them in my new patch set and also have a careful look at them.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-23 18:49:37 +0100, David Geier wrote:
> On 1/21/23 05:12, Andres Freund wrote:
> > We do currently do the conversion quite frequently.  Admittedly I was
> > partially motivated by trying to get the per-loop overhead in pg_test_timing
> > down ;)
> > 
> > But I think it's a real issue. Places where we do, but shouldn't, convert:
> > 
> > - ExecReScan() - quite painful, we can end up with a lot of those
> > - InstrStopNode() - adds a good bit of overhead to simple
> InstrStopNode() doesn't convert in the general case but only for the first
> tuple or when async. So it goes somewhat hand in hand with ExecReScan().

I think even the first-scan portion is likely noticable for quick queries -
you can quickly end up with 5-10 nodes, even for queries processed in the <
0.1ms range.

Of course it's way worse with rescans / loops.


> > - PendingWalStats.wal_write_time - this is particularly bad because it happens
> >    within very contended code
> > - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
> >    they can be very frequent
> > - pgbench.c, as we already discussed
> > - pg_stat_statements.c
> > - ...
> > 
> > These all will get a bit slower when moving to a "variable" frequency.

> I wonder if we will be able to measure any of them easily. But given that
> it's many more places than I had realized and given that the optimized code
> is not too involved, let's give it a try.

I think at least some should be converted to just accumulate in an
instr_time...



> > What was your approach for avoiding the costly operation?  I ended up with a
> > integer multiplication + shift approximation for the floating point
> > multiplication (which in turn uses the inverse of the division by the
> > frequency). To allow for sufficient precision while also avoiding overflows, I
> > had to make that branch conditional, with a slow path for large numbers of
> > nanoseconds.
> 
> It seems like we ended up with the same. I do:
> 
> sec = ticks / frequency_hz
> ns  = ticks / frequency_hz * 1,000,000,000
> ns  = ticks * (1,000,000,000 / frequency_hz)
> ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> 
> Now, the constant scaling factor in parentheses is typically a floating
> point number. For example for a frequency of 2.5 GHz it would be 2.5. To
> work around that we can do something like:
> 
> ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler
> 
> Where scaler is a power-of-2, big enough to maintain enough precision while
> allowing for a shift to implement the division.

Yep, at least quite similar.


> The additional multiplication with scaler makes that the maximum range go
> down, because we must ensure we never overflow. I'm wondering if we cannot
> pick scaler in such a way that remaining range of cycles is large enough for
> our use case and we can therefore live without bothering for the overflow
> case. What would be "enough"? 1 year? 10 years? ...

Depending on how low we want to keep the error, I don't think we can:

If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.

But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up

  math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
  ~= 5.87

So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.


But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.


> > I think it'd be great - but I'm not sure we're there yet, reliability and
> > code-complexity wise.

> Thanks to your commits, the diff of the new patch set will be already much
> smaller and easier to review. What's your biggest concern in terms of
> reliability?

- the restriction just to linux, that'll make testing harder for some, and
  ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
  suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
  nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work


> > I think it might be worth makign the rdts aspect somewhat
> > measurable. E.g. allowing pg_test_timing to use both at the same time, and
> > have it compare elapsed time with both sources of counters.
> I haven't yet looked into pg_test_timing. I'll do that while including your
> patches into the new patch set.

Cool.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-23 18:52:44 +0100, David Geier wrote:
> One thing I was wondering about: why did you chose to use a signed instead
> of an unsigned 64-bit integer for the ticks?

That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.


> > > If you have time to look at the pg_test_timing part, it'd be
> > > appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
> > > hesitant to push it.
> > I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
> > patch.
> > 
> > I've attached those two patches. Feel free to include them in your series if
> > you want, then the CF entry (and thus cfbot) makes sense again...
> I'll include them in my new patch set and also have a careful look at them.

Thanks.

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 1/23/23 18:41, Andres Freund wrote:
> If we add it, it probably shouldn't depend on TIMING, but on
> SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
> something like
>    EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
>
> the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
> Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
> separate options because per-node timing is problematic performance-wise
> (right now), but whole-query timing rarely is.
Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, 
in case we'll revisit exposing the clock source in the future.
> Another, independent, thing worth thinking about: I think we might want to
> expose both rdtsc and rdtscp. For something like
> InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
> quite important to avoid changing the query performance. But for measuring
> whole-query time, we likely want to measure the actual time.
>
> It probably won't matter hugely for the whole query time - the out of order
> window of modern CPUs is large, but not *that* large - but I don't think we
> can generally assume that.

That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and 
for now call that variant from InstrStartNode(), InstrEndNode() and 
pg_test_timing. To do so in InstrEndNode(), I removed 
INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro 
would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single 
place and the code is more readable that way. INSTR_TIME_SET_CURRENT() 
is called from a bunch of places. I still have to go through all of them 
and see which should be changed to call the _FAST() variant.

Attached is v7 of the patch:

- Rebased on latest master (most importantly on top of the int64 
instr_time commits). - Includes two commits from Andres which introduce 
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report 
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds 
only with integer math, while accounting for overflow. - Supports RDTSCP 
via INSTR_TIME_SET_CURRENT() and introduced 
INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC.

I haven't gotten to the following:

- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if 
they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing 
Andres commits. Potentially improving on pg_test_timing's output. - 
Looking at enabling RDTSC on more platforms. Is there a minimum set of 
platforms we would like support for? Windows should be easy. That would 
also allow to unify the code a little more. - Add more documentation and 
do more testing around the calls to CPUID. - Profiling and optimizing 
the code. A quick test showed about 10% improvement over master with 
TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail 
that started this thread.

I hope I'll find time to work on these points during the next days.

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi
>
> I think at least some should be converted to just accumulate in an
> instr_time...
I think that's for a later patch though?
> Yep, at least quite similar.
OK. I coded it up in the latest version of the patch.
> Depending on how low we want to keep the error, I don't think we can:
>
> If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
> for common ghz ranges. Clearly 33bits isn't an interesting range.
>
> But even if you accept a higher error - we don't have *that* much range
> available. Assuming an uint64, the range is ~584 years. If we want 10 years
> range, we end up
>
>    math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
>    ~= 5.87
>
> So 5 bits available that we could "use" for multiply/shift. For something like
> 2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
> range, ends up allowing a failure of 30796s = 8min over a year, still too
> high.
Thanks for doing the math. Agreed. The latest patch detects overflow and 
correctly handles it.
> But I don't think it's really an issue - normally that branch will never be
> taken (at least within the memory of the branch predictor), which on modern
> CPUs means it'll just be predicted as not taken. So as long as we tell the
> compiler what's the likely branch, it should be fine. At least as long as the
> branch compares with a hardcoded number.
Yeah. The overflow detection just compares two int64. The "overflow 
threshold" is pre-computed.
> - the restriction just to linux, that'll make testing harder for some, and
>    ends up encoding too much OS dependency
> - I think we need both the barrier and non-barrier variant, otherwise I
>    suspect we'll end up with inccuracies we don't want
> - needs lots more documentation about why certain cpuid registers are used
> - cpu microarch dependencies - isn't there, e.g., the case that the scale on
>    nehalem has to be different than on later architectures?
> - lack of facility to evaluate how well the different time sources work
Makes sense. I carried that list over to my latest e-mail which also 
includes the patch to have some sort of summary of where we are in a 
single place.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 1/23/23 21:30, Andres Freund wrote:
> That's been the case since my first post in the thread :). Mainly, it seems
> easier to detect underflow cases during subtraction that way. And the factor
> of 2 in range doesn't change a whole lot.
I just realized it the other day :).
>>>> If you have time to look at the pg_test_timing part, it'd be
>>>> appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
>>>> hesitant to push it.
>>> I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
>>> patch.
>>>
>>> I've attached those two patches. Feel free to include them in your series if
>>> you want, then the CF entry (and thus cfbot) makes sense again...
>> I'll include them in my new patch set and also have a careful look at them.

I reviewed the prerequisite patch which introduces 
INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my 
comments:

- The prerequisite patch looks good me.

- By default, the test query in the pg_test_timing doc runs serially. 
What about adding SET max_parallel_workers_per_gather = 0 to make sure 
it really always does (e.g. on a system with different settings for 
parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will 
be much more flaky.

- Why have you added a case distinction for diff == 0? Have you 
encountered this case? If so, how? Maybe add a comment.

- To further reduce overhead we could call INSTR_TIME_SET_CURRENT() 
multiple times. But then again: why do we actually care about the 
per-loop time? Why not instead sum up diff and divide by the number of 
iterations to exclude all the overhead in the first place?

- In the computation of the per-loop time in nanoseconds you can now use 
INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S.

The rest looks good to me. The rebased patches are part of the patch set 
I sent out yesterday in reply to another mail in this thread.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

On 2023-01-24 14:30:34 +0100, David Geier wrote:
> Attached is v7 of the patch:
> 
> - Rebased on latest master (most importantly on top of the int64 instr_time
> commits). - Includes two commits from Andres which introduce
> INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report
> pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only
> with integer math, while accounting for overflow. - Supports RDTSCP via
> INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which
> uses RDTSC.
> 
> I haven't gotten to the following:
> 
> - Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they
> should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres
> commits. Potentially improving on pg_test_timing's output. - Looking at
> enabling RDTSC on more platforms. Is there a minimum set of platforms we
> would like support for? Windows should be easy. That would also allow to
> unify the code a little more. - Add more documentation and do more testing
> around the calls to CPUID. - Profiling and optimizing the code. A quick test
> showed about 10% improvement over master with TIMING ON vs TIMING OFF, when
> using the test-case from Andres' e-mail that started this thread.
> 
> I hope I'll find time to work on these points during the next days.

This fails to build on several platforms:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751

Greetings,

Andres Freund



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi,

On 2/7/23 19:12, Andres Freund wrote:
> This fails to build on several platforms:
>
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751

I think I fixed the compilation errors. It was due to a few variables 
being declared under

#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no faster 
than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.

Beyond that I reviewed the first two patches a while ago in [1]. I hope 
we can progress with them to further reduce the size of this patch set.

[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 


-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi!

On 2/14/23 12:11, David Geier wrote:
> Hi,
>
> I think I fixed the compilation errors. It was due to a few variables 
> being declared under
>
> #if defined(__x86_64__) && defined(__linux__)
>
> while being used also under non x86 Linux.
>
> I also removed again the code to obtain the TSC frequency under 
> hypervisors because the TSC is usually emulated and therefore no 
> faster than clock_gettime() anyways. So we now simply fallback to 
> clock_gettime() on hypervisors when we cannot obtain the frequency via 
> leaf 0x16.
>
> Beyond that I reviewed the first two patches a while ago in [1]. I 
> hope we can progress with them to further reduce the size of this 
> patch set.
>
> [1] 
> https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 
>
>
It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.

-- 
David Geier
(ServiceNow)




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
David Geier
Date:
Hi!

On 2/14/23 13:48, David Geier wrote:
>
> It still fails.
>
> I'll get Cirrus-CI working on my own Github fork so I can make sure it 
> really compiles on all platforms before I submit a new version.

It took some time until Cirrus CI allowed me to run tests against my new 
GitHub account (there's a 3 days freeze to avoid people from getting 
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch 
which passes builds, rebased on latest master.

I also reviewed the first two patches a while ago in [1]. I hope we can 
progress with them to further reduce the size of this patch set.

Beyond that: I could work on support for more OSs (e.g. starting with 
Windows). Is there appetite for that or do we rather want to instead 
start with a smaller patch?

[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com

-- 
David Geier
(ServiceNow)

Attachment

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
vignesh C
Date:
On Mon, 20 Feb 2023 at 16:06, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi!
>
> On 2/14/23 13:48, David Geier wrote:
> >
> > It still fails.
> >
> > I'll get Cirrus-CI working on my own Github fork so I can make sure it
> > really compiles on all platforms before I submit a new version.
>
> It took some time until Cirrus CI allowed me to run tests against my new
> GitHub account (there's a 3 days freeze to avoid people from getting
> Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
> which passes builds, rebased on latest master.
>
> I also reviewed the first two patches a while ago in [1]. I hope we can
> progress with them to further reduce the size of this patch set.
>
> Beyond that: I could work on support for more OSs (e.g. starting with
> Windows). Is there appetite for that or do we rather want to instead
> start with a smaller patch?

Are we planning to continue on this and take it further?
I'm seeing that there has been no activity in this thread for nearly 1
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
vignesh C
Date:
On Sat, 20 Jan 2024 at 09:03, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 20 Feb 2023 at 16:06, David Geier <geidav.pg@gmail.com> wrote:
> >
> > Hi!
> >
> > On 2/14/23 13:48, David Geier wrote:
> > >
> > > It still fails.
> > >
> > > I'll get Cirrus-CI working on my own Github fork so I can make sure it
> > > really compiles on all platforms before I submit a new version.
> >
> > It took some time until Cirrus CI allowed me to run tests against my new
> > GitHub account (there's a 3 days freeze to avoid people from getting
> > Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
> > which passes builds, rebased on latest master.
> >
> > I also reviewed the first two patches a while ago in [1]. I hope we can
> > progress with them to further reduce the size of this patch set.
> >
> > Beyond that: I could work on support for more OSs (e.g. starting with
> > Windows). Is there appetite for that or do we rather want to instead
> > start with a smaller patch?
>
> Are we planning to continue on this and take it further?
> I'm seeing that there has been no activity in this thread for nearly 1
> year now, I'm planning to close this in the current commitfest unless
> someone is planning to take it forward.

Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 1 year, I have changed the
status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively.

Regards,
Vignesh



Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

From
Andres Freund
Date:
Hi,

At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?

Greetings,

Andres Freund