Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Date
Msg-id 33xt5wzr7g4kstfcugjm33rnm6uwtolay33ku5ctnfoyb235jm@7m42xkovofxr
Whole thread
In response to Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (Lukas Fittl <lukas@fittl.com>)
Responses Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
List pgsql-hackers
Hi,

Lukas, could you add some preliminary Reviewed-by tags to the
not-yet-committed commitse?


On 2026-03-25 18:17:44 -0700, Lukas Fittl wrote:
> From 7cc7e230e05947892b2aa479eba45144beede48e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sat, 31 Jan 2026 08:49:46 -0800
> Subject: [PATCH v14 1/6] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
>  separately

John, do you want to take this one, or should I take it?


> Subject: [PATCH v14 2/6] pg_test_timing: Reduce per-loop overhead

Pushed, after simplifying the following

> +    INSTR_TIME_ADD_NANOSEC(end_time, duration > 0 ? duration * NS_PER_S : 0);

by removing that ?:, since duration is unsigned and the caller checks for 0
and negative values. And even if a negative value or 0 were to be passed, it'd
still be harmless.



> From 8fbde9831fd89615a41af56917aed9faf619dbbb Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Fri, 25 Jul 2025 17:57:20 -0700
> Subject: [PATCH v14 3/6] instrumentation: Streamline ticks to nanosecond
>  conversion across platforms
> 
> The timing infrastructure (INSTR_* macros) measures time elapsed using
> clock_gettime() on POSIX systems, which returns the time as nanoseconds,
> and QueryPerformanceCounter() on Windows, which is a specialized timing
> clock source that returns a tick counter that needs to be converted to
> nanoseconds using the result of QueryPerformanceFrequency().
> 
> This conversion currently happens ad-hoc on Windows, e.g. when calling
> INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
> invocation, despite the frequency being stable after program start,
> incurring unnecessary overhead. It also causes a fractured implementation
> where macros are defined differently between platforms.
> 
> To ease code readability, and prepare for a future change that intends
> to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
> a new pg_ticks_to_ns() function that gets called on all platforms.
> 
> This function relies on a separately initialized ticks_per_ns_scaled
> value, that represents the conversion ratio. This value is initialized
> from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
> POSIX systems, which results in the ticks being treated as nanoseconds.
> Other architectures always directly return the original ticks.

> To support this, pg_initialize_timing() is introduced, and is now
> mandatory for both the backend and any frontend programs to call before
> utilizing INSTR_* macros.

Are there assertions detecting missing initializations? Ah, yes, there is.


> +/*
> + * Stores what the number of ticks needs to be multiplied with to end up
> + * with nanoseconds using integer math.
> + *
> + * On certain platforms (currently Windows) the ticks to nanoseconds conversion
> + * requires floating point math because:
> + *
> + * sec = ticks / frequency_hz
> + * ns  = ticks / frequency_hz * 1,000,000,000
> + * ns  = ticks * (1,000,000,000 / frequency_hz)
> + * ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> + *
> + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> + *
> + * To be able to use integer math we work around the lack of precision. We
> + * first scale the integer up (left shift by TICKS_TO_NS_SHIFT) and after the
> + * multiplication by the number of ticks in pg_ticks_to_ns() we shift right by
> + * the same amount. We utilize unsigned integers even though ticks are stored
> + * as a signed value to encourage compilers to generate better assembly.
> + *
> + * We remember the maximum number of ticks that can be multiplied by the scale
> + * factor without overflowing so we can check via a * b > max <=> a > max / b.
> + *
> + * On all other platforms we are using clock_gettime(), which uses nanoseconds
> + * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
> + * to return the original value.
> + */
> +uint64        ticks_per_ns_scaled = 0;
> +uint64        max_ticks_no_overflow = 0;
> +bool        timing_initialized = false;

Think it may be worth adding an example for a realistic max_ticks_no_overflow
to show that we almost always are going to be able to avoid the overflow case.


> +static void set_ticks_per_ns(void);
> +
> +void
> +pg_initialize_timing(void)
> +{
> +    if (timing_initialized)
> +        return;
> +
> +    set_ticks_per_ns();
> +    timing_initialized = true;
> +}
> +
> +#ifndef WIN32
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> +    ticks_per_ns_scaled = 0;
> +    max_ticks_no_overflow = 0;
> +}
> +
> +#else                            /* WIN32 */
> +
> +/* GetTimerFrequency returns counts per second */
> +static inline double
> +GetTimerFrequency(void)
> +{
> +    LARGE_INTEGER f;
> +
> +    QueryPerformanceFrequency(&f);
> +    return (double) f.QuadPart;
> +}
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> +    ticks_per_ns_scaled = (NS_PER_S << TICKS_TO_NS_SHIFT) / GetTimerFrequency();
> +    max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
> +}
> +
> +#endif                            /* WIN32 */

Seems a tiny bit odd to introduce them here with a name to then evolve the
naming pattern in the subsequent patches.

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
>  {
> -    LARGE_INTEGER f;
> +#if PG_INSTR_TICKS_TO_NS
> +    int64        ns = 0;
> +
> +    Assert(timing_initialized);
> +
> +    /*
> +     * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> +     * Unix
> +     */
> +    if (ticks_per_ns_scaled == 0)
> +        return ticks;
> +
> +    /*
> +     * Would multiplication overflow? If so perform computation in two parts.
> +     */
> +    if (unlikely(ticks > (int64) max_ticks_no_overflow))
> +    {
> +        /*
> +         * To avoid overflow, first scale total ticks down by the fixed
> +         * factor, and *afterwards* multiply them by the frequency-based scale
> +         * factor.
> +         *
> +         * The remaining ticks can follow the regular formula, since they
> +         * won't overflow.
> +         */
> +        int64        count = ticks >> TICKS_TO_NS_SHIFT;
> +
> +        ns = count * ticks_per_ns_scaled;
> +        ticks -= (count << TICKS_TO_NS_SHIFT);
> +    }
> +
> +    ns += (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT;
> +
> +    return ns;
> +#else
> +    Assert(timing_initialized);
>  
> -    QueryPerformanceFrequency(&f);
> -    return (double) f.QuadPart;
> +    return ticks;
> +#endif                            /* PG_INSTR_TICKS_TO_NS */

Kinda wondering what the best way would be to verify that the overflow case
actually works.  What about adding a regress.c test that checks that we get
the right results for something like

   INSTR_TIME_SET_ZERO(t);
   INSTR_TIME_ADD_NANOSEC(t, some_number);
   EXPECT_EQ_U64(INSTR_TIME_GET_NANOSEC(t), some_number);

for a few different some_numbers?



> Subject: [PATCH v14 4/6] instrumentation: Use Time-Stamp Counter (TSC) on
>  x86-64 for faster measurements

> +bool
> +check_timing_clock_source(int *newval, void **extra, GucSource source)
> +{
> +    /*
> +     * Ensure timing is initialized. On Windows (EXEC_BACKEND), GUC hooks can
> +     * be called during InitializeGUCOptions() before InitProcessGlobals() has
> +     * had a chance to run pg_initialize_timing().
> +     */
> +    pg_initialize_timing();

So doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is futile?


> diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
> index 1d9ee4fb588..ee0e3a3b0ab 100644
> --- a/src/bin/pg_test_timing/pg_test_timing.c
> +++ b/src/bin/pg_test_timing/pg_test_timing.c
> @@ -43,7 +43,9 @@ main(int argc, char *argv[])
>  
>      handle_args(argc, argv);
>  
> -    /* initialize timing infrastructure (required for INSTR_* calls) */
> +    /*
> +     * Initialize timing infrastructure (required for INSTR_* calls)
> +     */
>      pg_initialize_timing();
>  
>      loop_count = test_timing(test_duration);

This seems like a superfluous change?


> +bool
> +pg_set_timing_clock_source(TimingClockSourceType source)
> +{
> +    Assert(timing_initialized);
> +
> +#if PG_INSTR_TSC_CLOCK
> +    pg_initialize_timing_tsc();
> +#endif
> +
> +#if PG_INSTR_TSC_CLOCK
> +    switch (source)
> +    {
> +        case TIMING_CLOCK_SOURCE_AUTO:
> +            use_tsc = (tsc_frequency_khz > 0) && tsc_use_by_default();

A global named use_tsc seems a tad too short a name for my personal taste. Too
likely to also be used by something else.


> +int32        tsc_frequency_khz = -1;
> +
> +static uint32 tsc_calibrate(void);
> +
> +/*
> + * Detect the TSC frequency and whether RDTSCP is available on x86-64.
> + *
> + * This can't be reliably determined at compile time, since the
> + * availability of an "invariant" TSC (that is not affected by CPU
> + * frequency changes) is dependent on the CPU architecture. Additionally,
> + * there are cases where TSC availability is impacted by virtualization,
> + * where a simple cpuid feature check would not be enough.
> + */
> +static void
> +tsc_detect_frequency(void)
> +{
> +    tsc_frequency_khz = 0;

So we're using -1 for uninitialized and 0 for unknown? If so we should
document that.


> +    /* We require RDTSCP support, bail if not available */
> +    if (!x86_feature_available(PG_RDTSCP))
> +        return;
> +
> +    /* Determine speed at which the TSC advances */
> +    tsc_frequency_khz = x86_tsc_frequency_khz();
> +    if (tsc_frequency_khz > 0)
> +        return;
> +
> +    /*
> +     * CPUID did not give us the TSC frequency. If TSC is invariant and RDTSCP
> +     * is available, we can measure the frequency by comparing TSC ticks
> +     * against walltime using a short calibration loop.
> +     */
> +    if (x86_feature_available(PG_TSC_INVARIANT))
> +        tsc_frequency_khz = tsc_calibrate();
> +}

Why do we not need to always check if the TSC is invariant? Seems like a hard
requirement to me?



> +/*
> + * Calibrate the TSC frequency by comparing TSC ticks against walltime.
> + *
> + * Takes initial TSC and system clock snapshots, then loops, recomputing the
> + * frequency each TSC_CALIBRATION_SKIPS iterations from cumulative TSC
> + * ticks divided by elapsed time.
> + *
> + * Once the frequency estimate stabilizes (consecutive iterations agree), we
> + * consider it converged and the frequency in KHz is returned. If either too
> + * many iterations or a time limit passes without convergence, 0 is returned.
> + */
> +#define TSC_CALIBRATION_MAX_NS        (50 * NS_PER_MS)
> +#define TSC_CALIBRATION_ITERATIONS    1000000
> +#define TSC_CALIBRATION_SKIPS        100
> +#define TSC_CALIBRATION_STABLE_CYCLES    10
> +

> +static uint32
> +tsc_calibrate(void)
> +{
> +    instr_time    initial_wall;
> +    int64        initial_tsc;
> +    double        freq_khz = 0;
> +    double        prev_freq_khz = 0;
> +    int            stable_count = 0;
> +    int64        prev_tsc;
> +    uint32        unused;
> +
> +    /* Ensure INSTR_* time below work on system time */
> +    set_ticks_per_ns_system();
> +
> +    INSTR_TIME_SET_CURRENT(initial_wall);
> +
> +#ifdef _MSC_VER
> +    initial_tsc = __rdtscp(&unused);
> +#else
> +    initial_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif
> +    prev_tsc = initial_tsc;
> +
> +    for (int i = 0; i < TSC_CALIBRATION_ITERATIONS; i++)
> +    {
> +        instr_time    now_wall;
> +        int64        now_tsc;
> +        int64        elapsed_ns;
> +        int64        elapsed_ticks;
> +
> +        INSTR_TIME_SET_CURRENT(now_wall);
> +
> +#ifdef _MSC_VER
> +        now_tsc = __rdtscp(&unused);
> +#else
> +        now_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif

I'd put this in a helper, given that we already need it twice, and we might
later need to extend it for other architectures.


> +        /*
> +         * Skip if this is not the Nth cycle where we measure, if TSC hasn't
> +         * advanced, or we walked backwards for some reason.
> +         */
> +        if (i % TSC_CALIBRATION_SKIPS != 0 || now_tsc == prev_tsc || elapsed_ns <= 0 || elapsed_ticks <= 0)
> +            continue;

What's the goal of TSC_CALIBRATION_SKIPS?



> +        freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000;
> +
> +        /*
> +         * Once freq_khz / prev_freq_khz is small, check if it stays that way.
> +         * If it does for long enough, we've got a winner frequency.
> +         */
> +        if (prev_freq_khz != 0 && fabs(1 - freq_khz / prev_freq_khz) < 0.0001)
> +        {
> +            stable_count++;
> +            if (stable_count >= TSC_CALIBRATION_STABLE_CYCLES)
> +                return (uint32) freq_khz;
> +        }
> +        else
> +            stable_count = 0;
> +
> +        prev_tsc = now_tsc;
> +        prev_freq_khz = freq_khz;
> +    }
> +
> +    /* did not converge */
> +    return 0;
> +}

I think it may be worth teaching pg_test_timing to display the tsc frequency
and show the frequency determined by calibration even if we got it via
cpuid. Otherwise it'll be hard to debug this.


> +/*
> + * Initialize the TSC clock source by determining its usability and frequency.
> + *
> + * This can be called multiple times, as tsc_frequency_khz will be set to 0
> + * if a prior call determined the TSC is not usable. On EXEC_BACKEND (Windows),
> + * the TSC frequency may also be set by restore_backend_variables.
> + */
> +void
> +pg_initialize_timing_tsc(void)
> +{
> +    if (tsc_frequency_khz < 0)
> +        tsc_detect_frequency();
> +}

Maybe we should add a debug log with the frequency in the < 0 case?

> +uint32
> +x86_tsc_frequency_khz(void)
> +{
> +    unsigned int r[4] = {0};
> +
> +    if (!x86_feature_available(PG_TSC_INVARIANT))
> +        return 0;
> +
> +    if (x86_feature_available(PG_HYPERVISOR))
> +        return x86_hypervisor_tsc_frequency_khz();
> +
> +    /*
> +     * On modern Intel CPUs, the TSC is implemented by invariant timekeeping
> +     * hardware, also called "Always Running Timer", or ART. The ART stays
> +     * consistent even if the CPU changes frequency due to changing power
> +     * levels.
> +     *
> +     * As documented in "Determining the Processor Base Frequency" in the
> +     * "Intel® 64 and IA-32 Architectures Software Developer's Manual",
> +     * February 2026 Edition, we can get the TSC frequency as follows:
> +     *
> +     * Nominal TSC frequency = ( CPUID.15H:ECX[31:0] * CPUID.15H:EBX[31:0] ) /
> +     * CPUID.15H:EAX[31:0]
> +     *
> +     * With CPUID.15H:ECX representing the nominal core crystal clock
> +     * frequency, and EAX/EBX representing values used to translate the TSC
> +     * value to that frequency, see "Chapter 20.17 "Time-Stamp Counter" of
> +     * that manual.
> +     *
> +     * Older Intel CPUs, and other vendors do not set CPUID.15H:ECX, and as
> +     * such we fall back to alternate approaches.
> +     */
> +    pg_cpuid(0x15, r);
> +    if (r[ECX] > 0)
> +    {
> +        /*
> +         * EBX not being set indicates invariant TSC is not available. Require
> +         * EAX being non-zero too, to avoid a theoretical divide by zero.
> +         */
> +        if (r[EAX] == 0 || r[EBX] == 0)
> +            return 0;
> +
> +        return r[ECX] / 1000 * r[EBX] / r[EAX];
> +    }
> +
> +    /*
> +     * When CPUID.15H is not available/incomplete, but we have verified an
> +     * invariant TSC is used, we can instead get the processor base frequency
> +     * in MHz from CPUID.16H:EAX, the "Processor Frequency Information Leaf".
> +     */

s/can instead/can possibly instead/

As it's also optional...



> From 44292c17075264c659e03d784a6c7619e4c5bd3e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Tue, 10 Mar 2026 01:38:14 -0700
> Subject: [PATCH v14 6/6] instrumentation: ARM support for fast time
>  measurements
> 
> Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> use of the cntvct_el0 instruction on ARM systems to access the generic
> timer that provides a synchronized ticks value across CPUs.
> 
> Note this adds an exception for Apple Silicon CPUs, due to the observed
> fact that M3 and newer has different timer frequencies for the Efficiency
> and the Performance cores, and we can't be sure where we get scheduled.

There are other such SOCs with different clock speeds for
performance/efficiency cores, I'm a bit worried that just denylisting Apple
Silicon will end up allowing other such systems.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Srinath Reddy Sadipiralla
Date:
Subject: Re: Adding REPACK [concurrently]
Next
From: David Rowley
Date:
Subject: Re: More speedups for tuple deformation