Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date
Msg-id zwv2ggywiz23vghehofkvsrunlmrzc2zbrohd6i4j6a53meb4l@3vl36n4tbxvp
Whole thread
In response to Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (Lukas Fittl <lukas@fittl.com>)
Responses Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
List pgsql-hackers
Hi,

On 2026-04-06 04:25:36 -0700, Lukas Fittl wrote:
> On Sun, Apr 5, 2026 at 10:15 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > - tsc_use_by_default() may be documenting things that aren't the case anymore
> >
>
> I don't think there is a correctness issue, unless you mean the fact
> that we're not doing the < 8 socket check anymore that is one of the
> things mentioned on the LKML posts referenced. I think the LKML post
> references are still helpful as evidence why we trust that Intel has
> reliable TSC handling.

Well:
    "Mirrors the Linux kernel's clocksource watchdog disable logic"
doesn't seem quite right, given that in that place we are just checking
TSC_ADJUST and we don't have the < 8 socket check.

I'd probably say something like 'inspired by ... ' and mention that the rest
of the check is in tsc_detect_frequency().


I wonder if the cpuid tests should be a bit further abstracted into
pg_cpu_x86.c.

E.g. instead of tsc_detect_frequency() checking for PG_RDTSCP,
PG_TSC_INVARIANT, PG_TSC_ADJUST we could have

PG_TSC_AVAILABLE /* RDTSCP & INVARIANT */
PG_TSC_KNOWN_RELIABLE /* PG_TSC_AVAILABLE && PG_TSC_ADJUST */
PG_TSC_FREQUENCY_KNOWN /* x86_tsc_frequency_khz works */

and always run all of that during set_x86_features().


> I did note a bit of a grammar oddity at the end of the comment, fixed that.
>
> > - It may be paranoia, but it seems like tsc_calibrate() should perhaps save
> >   the old clock source and restore it at the end?
>
> Sure, seems reasonable. Done.

Cool.


> > - Should pg_initialize_timing() allow repeated initialization? Seems like that
> >   would normally be a bug?
>
> I'm trying to recall if restore_backend_variables might have a problem
> if we didn't allow that? (since we call pg_initialize_timing there,
> which I think is due to ordering)

Yea, that'd make it problematic.



> > - It's nice that pg_test_timing shows the frequency.  I was thinking it were
> >   able to show the result of the calibration, even if we were able to
> >   determine the frequency without calibration.  That should make it easier to
> >   figure out whether the calibration works well.
>
> Added. I've renamed "tsc_calibrate" to "pg_tsc_calibrate_frequency"
> and exported that, to support that.

Nice.

Workstation idle:
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499519 kHz

Busy:
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499262 kHz

Completely overwhelmed (load >1200):
TSC frequency in use: 2500000 kHz
TSC frequency from calibration: 2499405 kHz

That's very much good enough.


> FWIW, this now calibrates twice in pg_test_timing if we're on a system
> that has to use calibration. If we wanted to avoid that, we could
> introduce some kind of flag that indicated the TSC frequency was
> already determined through calibration. Not sure if needed?

I have no concern whatsoever with doing it twice in pg_test_timing.


> > - Wonder if some of the code would look a bit cleaner if timing_tsc_enabled,
> >   timing_tsc_frequency_khz were defined regardless of PG_INSTR_TSC_CLOCK.
>
> Yeah, I don't see harm in defining them always, and its easier on the
> eyes. Done. Likewise, I've also made the timing_tsc_frequency_khz in
> BackendParameters defined always.

Nice.

One thing this reminded me of is that pg_set_timing_clock_source() does:

bool
pg_set_timing_clock_source(TimingClockSourceType source)
{
    Assert(timing_initialized);

#if PG_INSTR_TSC_CLOCK
    pg_initialize_timing_tsc();

    switch (source)
    {
        case TIMING_CLOCK_SOURCE_AUTO:
            timing_tsc_enabled = (timing_tsc_frequency_khz > 0) && tsc_use_by_default();
            break;
        case TIMING_CLOCK_SOURCE_SYSTEM:
            timing_tsc_enabled = false;
            break;
        case TIMING_CLOCK_SOURCE_TSC:
            /* Tell caller TSC is not usable */
            if (timing_tsc_frequency_khz <= 0)
                return false;
            timing_tsc_enabled = true;
            break;
    }
#endif

    set_ticks_per_ns();
    timing_clock_source = source;
    return true;
}


Which means that if building without PG_INSTR_TSC_CLOCK and called with
TIMING_CLOCK_SOURCE_TSC, it'd return true despite having done something bogus.


I was also wondering if there an argument for moving the
pg_initialize_timing_tsc() into the relevant switch() cases, so the
calibration doesn't run if configured with TIMING_CLOCK_SOURCE_SYSTEM. But I
think because during GUC initialization we'll be called with the builtin
value, that wouldn't change anything?


> >
> > - is it ok that we are doing pg_cpuid_subleaf() without checking the result?
> >
> >   It's not clear to me if a failed __get_cpuid_count() would clear out the old
> >   reg or leave it in place.
>
> Hm, maybe best if we just memset reg in pg_cpuid_subleaf
> unconditionally, before calling __get_cpuid_count / __cpuidex.

Yea, I think that'd be safer.


> > - How much do we care about weird results when dynamically changing
> >   timing_clocksource?
> >
> > postgres[182055][1]=# EXPLAIN ANALYZE SELECT set_config('timing_clock_source', 'tsc', true), pg_sleep(1),
set_config('timing_clock_source','system', true), pg_sleep(1);
 
> > ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> > │                                                 QUERY PLAN                                                 │
> > ├────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> > │ Result  (cost=0.00..0.01 rows=1 width=72) (actual time=-6540570569.396..-6540570569.395 rows=1.00 loops=1) │
> > │ Planning Time: 0.184 ms                                                                                    │
> > │ Execution Time: -6540570569.355 ms                                                                         │
> > └────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> > (3 rows)
> >
> > Time: 2002.350 ms (00:02.002)
>
> That was brought up earlier in the thread as well, and I added a code
> comment in response.

Should there be a comment in the docs about it?


> I think the trade-off here is that if we make this a more restrictive
> GUC level (the main solution I can think of), we take away the ability
> for users to confirm whether the new timing logic caused their timings
> to be inaccurate.

Yea, it is very useful.  I guess an inbetween could be to make it SUSET.

Is there an argument that a user could hide the cost of their queries from
things like pg_stat_statements and that therefore it should be SUSET?


> And it seems very unlikely that someone would actually change the GUC within
> a query (or within a function).

I agree.  I think it's worth mentioning and worth thinking about whether it
needs to be SUSET (re the question above), but I don't see an argument for
making it PGC_SIGHUP or such.

Not for now, but I think it'd be nice if the GUC framework had a way of
expressing that some settings can only be changed at the top-level.


> >   - 'tsc' describes just x86-64, even if there is a patch to support aarch64.
> >     Perhaps it'd be enough to sprinkle a few "E.g. on x86-64, ..." around.
>
> Hmm. I'm not sure how we can improve that really by adding "E.g."
> somewhere, but maybe I don't follow.

+            <literal>tsc</literal> (measures timing using the x86-64 Time-Stamp Counter (TSC)
+            by directly executing RDTSC/RDTSCP instructions, see below)

If that instead is something like 'tsc' (measures timing with a CPU
instruction, e.g. using RDTSC/RDTSCP on x86-64) it would not be wrong even
after adding aarch64 support.



> What I could see us doing is explicitly calling out that TSC is not
> supported on other architectures?

Yea, I think it'd be good to mention that.



> Subject: [PATCH v20 1/5] instrumentation: Streamline ticks to nanosecond
>  conversion across platforms

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
>  {
> -    LARGE_INTEGER f;
> +#if PG_INSTR_TICKS_TO_NS
> +    int64        ns = 0;
> +
> +    Assert(timing_initialized);
> +
> +    /*
> +     * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> +     * Unix
> +     */

Maybe add something like "(in that case ticks is counted in nanoseconds)"?


Leaving aside that I don't think it makes sense to push this without also
pushing 0002/0003, I think this is eady.



> Subject: [PATCH v20 2/5] Allow retrieving x86 TSC frequency/flags from CPUID
>
> This adds additional x86 specific CPUID checks for flags needed for
> determining whether the Time-Stamp Counter (TSC) is usable on a given
> system, as well as a helper function to retrieve the TSC frequency from
> CPUID.
>
> This is intended for a future patch that will utilize the TSC to lower
> the overhead of timing instrumentation.
>
> In passing, always make pg_cpuid_subleaf reset the variables used for its
> result, to avoid accidentally using stale results if __get_cpuid_count
> errors out.



> +/*
> + * Determine the TSC frequency of the CPU through CPUID, where supported.
> + *
> + * Needed to interpret the tick value returned by RDTSC/RDTSCP. Return value of
> + * 0 indicates the frequency information was not accessible via CPUID.
> + */
> +uint32
> +x86_tsc_frequency_khz(void)
> +{
> +    unsigned int reg[4] = {0};
> +
> +    if (x86_feature_available(PG_HYPERVISOR))
> +        return x86_hypervisor_tsc_frequency_khz();


Is there a point in checking whether the things below are present if the
hypervisor specific logic doesn't find a freq?  I think it can be configured
to be passed through on some hypervisor / cpu combinations.



I think this is also close to ready, except for the minor details I raised at
the start and just here.


> From c58ea726f9c1a89eed46ee45a182457cea737d79 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Thu, 2 Apr 2026 13:17:11 -0700
> Subject: [PATCH v20 3/5] instrumentation: Use Time-Stamp Counter (TSC) on
>  x86-64 for faster measurements
>
> This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
> from the CPU using RDTSC/RDTSC instructions, instead of APIs like

Missing P at the end of RDTSC/RDTSC.


> clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
> ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
> reduced by up to 10% for queries moving lots of rows through the plan.

FWIW, I see considerably bigger gains in some cases. Mostly queries with many
query "levels". But even some simple ones:


Baseline:

SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;
\timing reports 322.548 ms

Baseline with EXPLAIN ANALYZE overhead:


EXPLAIN (ANALYZE, BUFFERS 0, TIMING OFF) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;

QUERY PLAN
Limit  (cost=168370.00..168370.02 rows=1 width=97) (actual rows=0.00 loops=1)
  ->  Seq Scan on pgbench_accounts  (cost=0.00..168370.00 rows=10000000 width=97) (actual rows=10000000.00 loops=1)
Planning Time: 0.059 ms
Execution Time: 426.570 ms

1.32 x slowdown.


SET timing_clock_source = 'system';
EXPLAIN (ANALYZE, BUFFERS 0) SELECT * FROM pgbench_accounts LIMIT 1 OFFSET 10000000;

Limit  (cost=168370.00..168370.02 rows=1 width=97) (actual time=882.843..882.843 rows=0.00 loops=1)
  ->  Seq Scan on pgbench_accounts  (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.021..593.587
rows=10000000.00loops=1)
 
Planning Time: 0.063 ms
Execution Time: 882.860 ms

2.06 x slowdown relative to TIMING OFF


SET timing_clock_source = 'tsc';
Limit  (cost=168370.00..168370.02 rows=1 width=97) (actual time=543.098..543.098 rows=0.00 loops=1)
  ->  Seq Scan on pgbench_accounts  (cost=0.00..168370.00 rows=10000000 width=97) (actual time=0.017..413.878
rows=10000000.00loops=1)
 
Planning Time: 0.061 ms
Execution Time: 543.122 ms

1.27 x slowdown relative to TIMING OFF

1.63x speedup relative to system.


But I also see ~20% gains for some TPCH queries, for example.


> To control use of the TSC, the new "timing_clock_source" GUC is introduced,
> whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
> in case the system clocksource is reported as "tsc". The use of the system
> APIs can be enforced by setting "system", or on x86-64 architectures the
> use of TSC can be enforced by explicitly setting "tsc".

It's more widely enabled by default now, right?


> In order to use the TSC the frequency is first determined by use of CPUID,
> and if not available, by running a short calibration loop at program start,
> falling back to the system time if TSC values are not stable.
>
> Note, that we split TSC usage into the RDTSC CPU instruction which does not
> wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
> which waits for outstanding instructions to retire. RDTSCP is deemed to have
> little benefit in the typical InstrStartNode() / InstrStopNode() use case of
> EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
> macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.
>
> The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
> to be used when precision is more important than performance. When the
> system timing clock source is used both of these macros instead utilize
> the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Maybe worth adding that there are other things that may be worth converting,
like track_io_timing/track_wal_io_timing.


> +const char *
> +show_timing_clock_source(void)
> +{
> +    switch (timing_clock_source)
> +    {
> +        case TIMING_CLOCK_SOURCE_AUTO:
> +#if PG_INSTR_TSC_CLOCK
> +            if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
> +                return "auto (tsc)";
> +#endif
> +            return "auto (system)";
> +        case TIMING_CLOCK_SOURCE_SYSTEM:
> +            return "system";
> +#if PG_INSTR_TSC_CLOCK
> +        case TIMING_CLOCK_SOURCE_TSC:
> +            return "tsc";
> +#endif

For a moment I was wondering if we should have this display the frequency and
whether it's calibrated. But I think that's too cute by half.



> +static void
> +set_ticks_per_ns(void)
> +{
> +#if PG_INSTR_TSC_CLOCK
> +    if (timing_tsc_enabled)
> +        set_ticks_per_ns_for_tsc();
> +    else
> +        set_ticks_per_ns_system();
> +#else
> +    set_ticks_per_ns_system();
> +#endif
> +}

How about?

static void
set_ticks_per_ns(void)
{
#if PG_INSTR_TSC_CLOCK
    if (timing_tsc_enabled)
    {
        set_ticks_per_ns_for_tsc();
        return;
    }
#endif
    set_ticks_per_ns_system();
}


> @@ -83,27 +88,90 @@ typedef struct instr_time
>  /* Shift amount for fixed-point ticks-to-nanoseconds conversion. */
>  #define TICKS_TO_NS_SHIFT 14
>
> -#ifdef WIN32
> -#define PG_INSTR_TICKS_TO_NS 1
> -#else
> -#define PG_INSTR_TICKS_TO_NS 0
> -#endif
> -

I'd add it to the place it'll later be added.


Think this is quite close.



> Subject: [PATCH v20 4/5] pg_test_timing: Also test RDTSC/RDTSCP timing and
>  report time source and TSC frequency


> +        /* Now, emit fast timing measurements */
> +        loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
> +        output(loop_count);
> +        printf("\n");
> +
> +        printf(_("TSC frequency in use: %u kHz\n"), timing_tsc_frequency_khz);
> +
> +        calibrated_freq = pg_tsc_calibrate_frequency();
> +        if (calibrated_freq > 0)
> +            printf(_("TSC frequency from calibration: %u kHz\n"), calibrated_freq);
> +        else
> +            printf(_("TSC calibration did not converge\n"));

If this were to indicate if the current frequency were from a non-calibration
source it'd be perfect, but that's definitely not required.



> Subject: [PATCH v20 5/5] instrumentation: ARM support for fast time
>  measurements
>
> Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> use of the cntvct_el0 instruction on ARM systems to access the generic
> timer that provides a synchronized ticks value across CPUs.
>
> Note this adds an exception for Apple Silicon CPUs, due to the observed
> fact that M3 and newer has different timer frequencies for the Efficiency
> and the Performance cores, and we can't be sure where we get scheduled.
>
> To simplify the implementation this does not support Windows on ARM,
> since its quite rare and hard to test.
>
> Relies on the existing timing_clock_source GUC to control whether
> TSC-like timer gets used, instead of system timer.


> +/*
> + * Check whether this is a heterogeneous Apple Silicon P+E core system
> + * where CNTVCT_EL0 may tick at different rates on different core types.
> + */
> +static bool
> +aarch64_has_heterogeneous_cores(void)
> +{
> +#if defined(__APPLE__)
> +    int            nperflevels = 0;
> +    size_t        len = sizeof(nperflevels);
> +
> +    if (sysctlbyname("hw.nperflevels", &nperflevels, &len, NULL, 0) == 0)
> +        return nperflevels > 1;
> +#endif
> +
> +    return false;
> +}
> +
> +/*
> + * Detect the generic timer frequency on AArch64.
> + */
> +static void
> +tsc_detect_frequency(void)
> +{
> +    if (aarch64_has_heterogeneous_cores())
> +    {
> +        timing_tsc_frequency_khz = 0;
> +        return;
> +    }
> +
> +    timing_tsc_frequency_khz = aarch64_cntvct_frequency_khz();
> +}

> +/*
> + * The ARM generic timer is architecturally guaranteed to be monotonic and
> + * synchronized across cores of the same type, so we always use it by default
> + * when available and cores are homogenous.
> + */
> +static bool
> +tsc_use_by_default(void)
> +{
> +    return true;
> +}

I'm somewhat sceptical of that being viable, given that we only have support
for detect heteregenous cores on macos.  You e.g. can run linux on M*
hardware. And I wonder if other big.little heterogeneous architectures have
the same probem...


> +uint32
> +pg_tsc_calibrate_frequency(void)
> +{
> +    /* No calibration loop on AArch64; frequency comes from CNTFRQ_EL0 */
> +    return 0;
> +}

Think I'd advocate for support for that if/when we add ARM support, even if
it's just to be able to verify things are sane via pg_test_timing.


> @@ -144,7 +150,6 @@ extern bool pg_set_timing_clock_source(TimingClockSourceType source);
>  #define PG_INSTR_TICKS_TO_NS 0
>  #endif
>
> -
>  /* Whether to actually use TSC based on availability and GUC settings. */
>  extern PGDLLIMPORT bool timing_tsc_enabled;
>

Spurious line change.


> +#elif defined(__aarch64__) && !defined(WIN32)
> +
> +/*
> + * Read the ARM generic timer counter (CNTVCT_EL0).
> + *
> + * The "fast" variant reads the counter without a barrier, analogous to RDTSC
> + * on x86. The regular variant issues an ISB (Instruction Synchronization
> + * Barrier) first, which acts as a serializing instruction analogous to RDTSCP,
> + * ensuring all preceding instructions have completed before reading the
> + * counter.
> + */
> +static inline instr_time
> +pg_get_ticks_fast(void)
> +{
> +    if (likely(timing_tsc_enabled))
> +    {
> +        instr_time    now;
> +
> +        now.ticks = __builtin_arm_rsr64("cntvct_el0");
> +        return now;

Seems like this is about !msvc (or rather a gcc like compiler), rather than
about windows?


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Lukas Fittl
Date:
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Next
From: Noah Misch
Date:
Subject: Re: Adding REPACK [concurrently]