Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
Date | |
Msg-id | 20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
List | pgsql-hackers |
Hi, Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes the resulting timing enough that the times aren't meaningful. E.g. CREATE TABLE lotsarows(key int not null); INSERT INTO lotsarows SELECT generate_series(1, 50000000); VACUUM FREEZE lotsarows; -- best of three: SELECT count(*) FROM lotsarows; Time: 1923.394 ms (00:01.923) -- best of three: EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; Time: 2319.830 ms (00:02.320) -- best of three: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4202.649 ms (00:04.203) That nearly *double* the execution time without TIMING. Looking at a profile of this shows that we spend a good bit of cycles "normalizing" timstamps etc. That seems pretty unnecessary, just forced on us due to struct timespec. So the first attached patch just turns instr_time to be a 64bit integer, counting nanoseconds. That helps, a tiny bit: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4179.302 ms (00:04.179) but obviously doesn't move the needle. Looking at a profile it's easy to confirm that we spend a lot of time acquiring time: - 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined) - agg_retrieve_direct (inlined) - 79.27% fetch_input_tuple - ExecProcNode (inlined) - 75.72% ExecProcNodeInstr + 25.22% SeqNext - 21.74% InstrStopNode + 17.80% __GI___clock_gettime (inlined) - 21.44% InstrStartNode + 19.23% __GI___clock_gettime (inlined) + 4.06% ExecScan + 13.09% advance_aggregates (inlined) 1.06% MemoryContextReset And that's even though linux avoids a syscall (in most cases) etc to acquire the time. Unless the kernel detects there's a reason not to do so, linux does this by using 'rdtscp' and multiplying it by kernel provided factors to turn the cycles into time. Some of the time is spent doing function calls, dividing into struct timespec, etc. But most of it just the rdtscp instruction: 65.30 │1 63: rdtscp The reason for that is largely that rdtscp waits until all prior instructions have finished (but it allows later instructions to already start). Multiple times for each tuple. In the second attached prototype patch I've change instr_time to count in cpu cycles instead of nanoseconds. And then just turned the cycles into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later). When using rdtsc that results in *vastly* lower overhead: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │ │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1)│ │ Planning Time: 0.028 ms │ │ Execution Time: 2610.256 ms │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (4 rows) Time: 2610.589 ms (00:02.611) And there's still some smaller improvements that could be made ontop of that. As a comparison, here's the time when using rdtscp directly in instr_time, instead of going through clock_gettime: Time: 3481.162 ms (00:03.481) That shows pretty well how big the cost of the added pipeline stalls are, and how important out-of-order execution is for decent performance... In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we do *not* want to wait for prior instructions to finish, since that actually leads to the timing being less accurate, rather than more. There are other cases where that'd be different, e.g. measuring how long an entire query takes or such (but there it's probably irrelevant which to use). I've above skipped a bit over the details of how to turn the cycles returned by rdtsc into time: On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that have actually been run, but instead returns the number of 'reference cycles'. That's important because otherwise things like turbo mode and lower power modes would lead to completely bogus times. Thus, knowing the "base frequency" of the CPU allows to turn the difference between two rdtsc return values into seconds. In the attached prototype I just determined the number of cycles using cpuid(0x16). That's only available since Skylake (I think). On older CPUs we'd have to look at /proc/cpuinfo or /sys/devices/system/cpu/cpu0/cpufreq/base_frequency. There's also other issues with using rdtsc directly: On older CPUs, in particular older multi-socket systems, the tsc will not be synchronized in detail across cores. There's bits that'd let us check whether tsc is suitable or not. The more current issue of that is that things like virtual machines being migrated can lead to rdtsc suddenly returning a different value / the frequency differening. But that is supposed to be solved these days, by having virtualization technologies set frequency multipliers and offsets which then cause rdtsc[p] to return something meaningful, even after migration. The attached patches are really just a prototype. I'm also not really planning to work on getting this into a "production ready" patchset anytime soon. I developed it primarily because I found it the overhead made it too hard to nail down in which part of a query tree performance changed. If somebody else wants to continue from here... I do think it's be a pretty significant improvement if we could reduce the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a bunch of low-level code. Greetings, Andres Freund
Attachment
pgsql-hackers by date: