Thread: Sampling-based timing for EXPLAIN ANALYZE
Hi,
Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on modern hardware (despite time sources being faster), I'd like to propose a new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to TIMING ON.
This new timing mode uses a timer on a fixed recurring frequency (e.g. 100 or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead of getting the time on-demand when InstrStartNode/InstrStopNode is called. To implement the timer, we can use the existing timeout infrastructure, which is backed by a wall clock timer (ITIMER_REAL).
Conceptually this is inspired by how sampling profilers work (e.g. "perf"), but it ties into the existing per-plan node instrumentation done by EXPLAIN ANALYZE, and simply provides a lower accuracy version of the total time for each plan node.
In EXPLAIN output this is marked as "sampled time", and scaled to the total wall clock time (to adjust for the sampling undercounting):
=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979 width=4) (actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979 width=4) (actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
---
In simple query tests like this on my local machine, this shows a consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING OFF), whilst providing a "good enough" accuracy to identify which part of the query was problematic.
Attached is a prototype patch for early feedback on the concept, with tests and documentation to come in a follow-up. Since the January commitfest is still marked as open I'll register it there, but note that my assumption is this is *not* Postgres 16 material.
As an open item, note that in the patch the requested sampling frequency is not yet passed to parallel workers (it always defaults to 1000 Hz when sampling is enabled). Also, note the timing frequency is limited to a maximum of 1000 Hz (1ms) due to current limitations of the timeout infrastructure.
With thanks to Andres Freund for help on refining the idea, collaborating on early code and finding the approach to hook into the timeout API.
Thanks,
Lukas
--
Lukas Fittl
Attachment
Nice addition! And the code looks pretty straight forward.
The current patch triggers warnings: https://cirrus-ci.com/task/6016013976731648 Looks like you need to add void as the argument.
Do you have some performance comparison between TIMING ON and TIMING SAMPLING?
In InstrStartSampling there's logic to increase/decrease the frequency of an already existing timer. It's not clear to me when this can occur. I'd expect sampling frequency to remain constant throughout an explain plan. If it's indeed needed, I think a code comment would be useful to explain why this edge case is necessary.
On Fri, 6 Jan 2023 at 09:41, Lukas Fittl <lukas@fittl.com> wrote:
Hi,Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on modern hardware (despite time sources being faster), I'd like to propose a new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to TIMING ON.This new timing mode uses a timer on a fixed recurring frequency (e.g. 100 or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead of getting the time on-demand when InstrStartNode/InstrStopNode is called. To implement the timer, we can use the existing timeout infrastructure, which is backed by a wall clock timer (ITIMER_REAL).Conceptually this is inspired by how sampling profilers work (e.g. "perf"), but it ties into the existing per-plan node instrumentation done by EXPLAIN ANALYZE, and simply provides a lower accuracy version of the total time for each plan node.In EXPLAIN output this is marked as "sampled time", and scaled to the total wall clock time (to adjust for the sampling undercounting):=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979 width=4) (actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms---In simple query tests like this on my local machine, this shows a consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING OFF), whilst providing a "good enough" accuracy to identify which part of the query was problematic.Attached is a prototype patch for early feedback on the concept, with tests and documentation to come in a follow-up. Since the January commitfest is still marked as open I'll register it there, but note that my assumption is this is *not* Postgres 16 material.As an open item, note that in the patch the requested sampling frequency is not yet passed to parallel workers (it always defaults to 1000 Hz when sampling is enabled). Also, note the timing frequency is limited to a maximum of 1000 Hz (1ms) due to current limitations of the timeout infrastructure.With thanks to Andres Freund for help on refining the idea, collaborating on early code and finding the approach to hook into the timeout API.Thanks,Lukas
--Lukas Fittl
Nice idea. On 1/6/23 10:19, Jelte Fennema wrote: > Do you have some performance comparison between TIMING ON and TIMING > SAMPLING? +1 to see some numbers compared to TIMING ON. Mostly I'm wondering if the sampling based approach gains us enough to be worth it, once the patch to use RDTSC hopefully landed (see [1]). I believe that with the RDTSC patch the overhead of TIMING ON is lower than the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to be really useful, it would be great if we could on top of TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g. by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton of branches and without all the typically deactivated code the implementation would be very small and could be placed in an inlinable function. [1] https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de -- David Geier (ServiceNow)
Hi, On 2023-01-13 09:11:06 +0100, David Geier wrote: > Mostly I'm wondering if the sampling based approach gains us enough to be > worth it, once the patch to use RDTSC hopefully landed (see [1]). Well, I'm not sure we have a path forward on it. There's portability and accuracy concerns. But more importantly: > I believe that with the RDTSC patch the overhead of TIMING ON is lower than > the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to > be really useful, it would be great if we could on top of TIMING SAMPLING > also lower the overhead of ANALYZE itself further (e.g. by using a fast path > for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently, > InstrStartNode() and InstrStopNode() have a ton of branches and without all > the typically deactivated code the implementation would be very small and > could be placed in an inlinable function. Yes, I think SAMPLING could get rid of most of the instrumentation overhead - at the cost of a bit less detail in the explain, of course. Which could make it a lot more feasible to enable something like auto_explain.log_timing in busy workloads. For the sampling mode we don't really need something like InstrStart/StopNode. We just need a pointer to node currently executing - not free to set, but still a heck of a lot cheaper than InstrStopNode(), even without ->need_timer etc. Then the timer just needs to do instr->sampled_total += (now - last_sample) last_sample = now I've been thinking that we should consider making more of the instrumentation code work like that. The amount of work we're doing in InstrStart/StopNode() has steadily risen. When buffer usage and WAL usage are enabled, we're executing over 1000 instructions! And each single Instrumentation node is ~450 bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs embedded. If we instead have InstrStartNode() set up a global pointer to the Instrumentation node, we can make the instrumentation code modify both the "global" counters (pgBufferUsage, pgWalUsage) and, if set, current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger changes - right now nodes "automatically" include the IO/WAL incurred in child nodes, but that's just a small bit of additional summin-up to be done during EXPLAIN. Separately, I think we should consider re-ordering Instrumentation so that bufusage_start, walusage_start are after the much more commonly used elements. We're forcing ntuples, nloops, .. onto separate cachelines, even though they're accounted for unconditionally. Greetings, Andres Freund
On Fri, Jan 6, 2023 at 1:19 AM Jelte Fennema <me@jeltef.nl> wrote:
Nice addition! And the code looks pretty straight forward.
Thanks for reviewing!
The current patch triggers warnings: https://cirrus-ci.com/task/6016013976731648 Looks like you need to add void as the argument.
Fixed in v2 attached. This also adds a simple regression test, as well as fixes the parallel working handling.
Do you have some performance comparison between TIMING ON and TIMING SAMPLING?
Here are some benchmarks of auto_explain overhead on my ARM-based M1 Macbook for the following query run with pgbench on a scale factor 100 data set:
SELECT COUNT(*) FROM pgbench_branches JOIN pgbench_accounts USING (bid) JOIN pgbench_tellers USING (bid) WHERE bid = 42;
(the motivation is to use a query that is more complex than the standard pgbench select-only test query)
avg latency (best of 3), -T 300, -c 4, -s 100, shared_buffers 2GB, fsync off, max_parallel_workers_per_gather 0:
master, log_timing = off: 871 ms (878 / 877 / 871)
patch, log_timing = off: 869 ms (882 / 880 / 869)
patch, log_timing = on: 890 ms (917 / 930 / 890)
patch, log_timing = sampling, samplefreq = 1000: 869 ms (887 / 869 / 894)
Additionally, here is Andres' benchmark from [1], with the sampling option added:
% psql -Xc 'DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT * FROM generate_series(1, 100000) g(i);' postgres && pgbench -n -r -t 100 -f <(echo -e "SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;") postgres |grep '^ '
DROP TABLE
SELECT 100000
3.507 0 SELECT COUNT(*) FROM t;
3.476 0 EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
3.576 0 EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM t;
5.096 0 EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
DROP TABLE
SELECT 100000
3.507 0 SELECT COUNT(*) FROM t;
3.476 0 EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
3.576 0 EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM t;
5.096 0 EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
My pg_test_timing data for reference:
% pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
< us % of total count
1 97.64472 123876325
2 2.35421 2986658
4 0.00022 277
8 0.00016 202
16 0.00064 815
32 0.00005 64
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
< us % of total count
1 97.64472 123876325
2 2.35421 2986658
4 0.00022 277
8 0.00016 202
16 0.00064 815
32 0.00005 64
In InstrStartSampling there's logic to increase/decrease the frequency of an already existing timer. It's not clear to me when this can occur. I'd expect sampling frequency to remain constant throughout an explain plan. If it's indeed needed, I think a code comment would be useful to explain why this edge case is necessary.
Clarified in a code comment in v2. This is needed for handling nested statements which could have different sampling frequencies for each nesting level, i.e. a function might want to sample it's queries at a higher frequency than its caller.
Thanks,
--
Lukas Fittl
Attachment
On 1/15/23 21:22, Andres Freund wrote: > Hi, > > On 2023-01-13 09:11:06 +0100, David Geier wrote: >> Mostly I'm wondering if the sampling based approach gains us enough to be >> worth it, once the patch to use RDTSC hopefully landed (see [1]). > > Well, I'm not sure we have a path forward on it. There's portability and > accuracy concerns. But more importantly: > >> I believe that with the RDTSC patch the overhead of TIMING ON is lower than >> the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to >> be really useful, it would be great if we could on top of TIMING SAMPLING >> also lower the overhead of ANALYZE itself further (e.g. by using a fast path >> for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently, >> InstrStartNode() and InstrStopNode() have a ton of branches and without all >> the typically deactivated code the implementation would be very small and >> could be placed in an inlinable function. > > Yes, I think SAMPLING could get rid of most of the instrumentation overhead - > at the cost of a bit less detail in the explain, of course. Which could make > it a lot more feasible to enable something like auto_explain.log_timing in > busy workloads. > > For the sampling mode we don't really need something like > InstrStart/StopNode. We just need a pointer to node currently executing - not > free to set, but still a heck of a lot cheaper than InstrStopNode(), even > without ->need_timer etc. Then the timer just needs to do > instr->sampled_total += (now - last_sample) > last_sample = now > I don't understand why we would even use timestamps, in this case? AFAIK "sampling profilers" simply increment a counter for the executing node, and then approximate the time as proportional to the count. That also does not have issues with timestamp "rounding" - considering e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite possible the node completes within 1ms, in which case (now - last_sample) ends up being 0 (assuming I correctly understand the code). And I don't think there's any particularly good way to correct this. It seems ExplainNode() attempts to do some correction, but I doubt that's very reliable, as these fast nodes will have sampled_total=0, so no matter what you multiply this with, it'll still be 0. > > I've been thinking that we should consider making more of the instrumentation > code work like that. The amount of work we're doing in InstrStart/StopNode() > has steadily risen. When buffer usage and WAL usage are enabled, we're > executing over 1000 instructions! And each single Instrumentation node is ~450 > bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs > embedded. > > If we instead have InstrStartNode() set up a global pointer to the > Instrumentation node, we can make the instrumentation code modify both the > "global" counters (pgBufferUsage, pgWalUsage) and, if set, > current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger > changes - right now nodes "automatically" include the IO/WAL incurred in child > nodes, but that's just a small bit of additional summin-up to be done during > EXPLAIN. > That's certainly one way to implement that. I wonder if we could make that work without the global pointer, but I can't think of any. > > Separately, I think we should consider re-ordering Instrumentation so that > bufusage_start, walusage_start are after the much more commonly used > elements. We're forcing ntuples, nloops, .. onto separate cachelines, even > though they're accounted for unconditionally. > +1 to that regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote: > I don't understand why we would even use timestamps, in this case? AFAIK > "sampling profilers" simply increment a counter for the executing node, > and then approximate the time as proportional to the count. The timer interrupt distances aren't all that evenly spaced, particularly under load, and are easily distorted by having to wait for IO, an lwlock ... > That also does not have issues with timestamp "rounding" - considering > e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite > possible the node completes within 1ms, in which case > > (now - last_sample) > > ends up being 0 (assuming I correctly understand the code). That part should be counting in nanoseconds, I think? Unless I misunderstand something? We already compute the timestamp inside timeout.c, but don't yet pass that to timeout handlers. I think there's others re-computing timestamps. > And I don't think there's any particularly good way to correct this. > > It seems ExplainNode() attempts to do some correction, but I doubt > that's very reliable, as these fast nodes will have sampled_total=0, so > no matter what you multiply this with, it'll still be 0. That's just the scaling to the "actual time" that you're talking about above, no? > > I've been thinking that we should consider making more of the instrumentation > > code work like that. The amount of work we're doing in InstrStart/StopNode() > > has steadily risen. When buffer usage and WAL usage are enabled, we're > > executing over 1000 instructions! And each single Instrumentation node is ~450 > > bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs > > embedded. > > > > If we instead have InstrStartNode() set up a global pointer to the > > Instrumentation node, we can make the instrumentation code modify both the > > "global" counters (pgBufferUsage, pgWalUsage) and, if set, > > current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger > > changes - right now nodes "automatically" include the IO/WAL incurred in child > > nodes, but that's just a small bit of additional summin-up to be done during > > EXPLAIN. > > > > That's certainly one way to implement that. I wonder if we could make > that work without the global pointer, but I can't think of any. I don't see a realistic way at least. We could pass down an "InstrumentationContext" through everything that needs to do IO and WAL. But that seems infeasible at this point. Greetings, Andres Freund
On 1/17/23 18:02, Andres Freund wrote: > Hi, > > On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote: >> I don't understand why we would even use timestamps, in this case? AFAIK >> "sampling profilers" simply increment a counter for the executing node, >> and then approximate the time as proportional to the count. > > The timer interrupt distances aren't all that evenly spaced, particularly > under load, and are easily distorted by having to wait for IO, an lwlock ... > OK, so the difference is that these events (I/O, lwlocks) may block signals, and after signals get unblocked we only get a single event for each signal. Yeah, the timestamp handles that case better. > >> That also does not have issues with timestamp "rounding" - considering >> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite >> possible the node completes within 1ms, in which case >> >> (now - last_sample) >> >> ends up being 0 (assuming I correctly understand the code). > > That part should be counting in nanoseconds, I think? Unless I misunderstand > something? > The higher precision does not help, because both values come from the *sampled* timestamp (i.e. the one updated from the signal handler). So if the node happens to execute between two signals, the values are going to be the same, and the difference is 0. Perhaps for many executions it works out, because some executions will cross the boundary, and the average will converge to the right value. > We already compute the timestamp inside timeout.c, but don't yet pass that to > timeout handlers. I think there's others re-computing timestamps. > > >> And I don't think there's any particularly good way to correct this. >> >> It seems ExplainNode() attempts to do some correction, but I doubt >> that's very reliable, as these fast nodes will have sampled_total=0, so >> no matter what you multiply this with, it'll still be 0. > > That's just the scaling to the "actual time" that you're talking about above, > no? > Maybe, not sure. > >>> I've been thinking that we should consider making more of the instrumentation >>> code work like that. The amount of work we're doing in InstrStart/StopNode() >>> has steadily risen. When buffer usage and WAL usage are enabled, we're >>> executing over 1000 instructions! And each single Instrumentation node is ~450 >>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs >>> embedded. >>> >>> If we instead have InstrStartNode() set up a global pointer to the >>> Instrumentation node, we can make the instrumentation code modify both the >>> "global" counters (pgBufferUsage, pgWalUsage) and, if set, >>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger >>> changes - right now nodes "automatically" include the IO/WAL incurred in child >>> nodes, but that's just a small bit of additional summin-up to be done during >>> EXPLAIN. >>> >> >> That's certainly one way to implement that. I wonder if we could make >> that work without the global pointer, but I can't think of any. > > I don't see a realistic way at least. We could pass down an > "InstrumentationContext" through everything that needs to do IO and WAL. But > that seems infeasible at this point. > Why infeasible? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote: > On 1/17/23 18:02, Andres Freund wrote: > > On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote: > >> That also does not have issues with timestamp "rounding" - considering > >> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite > >> possible the node completes within 1ms, in which case > >> > >> (now - last_sample) > >> > >> ends up being 0 (assuming I correctly understand the code). > > > > That part should be counting in nanoseconds, I think? Unless I misunderstand > > something? > > > > The higher precision does not help, because both values come from the > *sampled* timestamp (i.e. the one updated from the signal handler). So > if the node happens to execute between two signals, the values are going > to be the same, and the difference is 0. In that case there simply wasn't any sample for the node, and a non-timestamp based sample counter wouldn't do anything different? If you're worried about the case where a timer does fire during execution of the node, but exactly once, that should provide a difference between the last sampled timestamp and the current time. It'll attribute a bit too much to the in-progress nodes, but well, that's sampling for you. I think a "hybrid" explain mode might be worth thinking about. Use the "current" sampling method for the first execution of a node, and for the first few milliseconds of a query (or perhaps the first few timestamp acquisitions). That provides an accurate explain analyze for short queries, without a significant slowdown. Then switch to sampling, which provides decent attribution for a bit longer running queries. > >>> I've been thinking that we should consider making more of the instrumentation > >>> code work like that. The amount of work we're doing in InstrStart/StopNode() > >>> has steadily risen. When buffer usage and WAL usage are enabled, we're > >>> executing over 1000 instructions! And each single Instrumentation node is ~450 > >>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs > >>> embedded. > >>> > >>> If we instead have InstrStartNode() set up a global pointer to the > >>> Instrumentation node, we can make the instrumentation code modify both the > >>> "global" counters (pgBufferUsage, pgWalUsage) and, if set, > >>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger > >>> changes - right now nodes "automatically" include the IO/WAL incurred in child > >>> nodes, but that's just a small bit of additional summin-up to be done during > >>> EXPLAIN. > >>> > >> > >> That's certainly one way to implement that. I wonder if we could make > >> that work without the global pointer, but I can't think of any. > > > > I don't see a realistic way at least. We could pass down an > > "InstrumentationContext" through everything that needs to do IO and WAL. But > > that seems infeasible at this point. > Why infeasible? Primarily the scale of the change. We'd have to pass down the context into all table/index AM functions. And into a lot of bufmgr.c, xlog.c functions, which'd require their callers to have access to the context. That's hundreds if not thousands places. Adding that many function parameters might turn out to be noticable runtime wise, due to increased register movement. I think for a number of the places where we currently don't, we ought to use by-reference struct for the not-always-used parameters, that then also could contain this context. Greetings, Andres Freund
On 1/17/23 19:46, Andres Freund wrote: > Hi, > > On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote: >> On 1/17/23 18:02, Andres Freund wrote: >>> On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote: >>>> That also does not have issues with timestamp "rounding" - considering >>>> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite >>>> possible the node completes within 1ms, in which case >>>> >>>> (now - last_sample) >>>> >>>> ends up being 0 (assuming I correctly understand the code). >>> >>> That part should be counting in nanoseconds, I think? Unless I misunderstand >>> something? >>> >> >> The higher precision does not help, because both values come from the >> *sampled* timestamp (i.e. the one updated from the signal handler). So >> if the node happens to execute between two signals, the values are going >> to be the same, and the difference is 0. > > In that case there simply wasn't any sample for the node, and a non-timestamp > based sample counter wouldn't do anything different? > Yeah, you're right. > If you're worried about the case where a timer does fire during execution of > the node, but exactly once, that should provide a difference between the last > sampled timestamp and the current time. It'll attribute a bit too much to the > in-progress nodes, but well, that's sampling for you. > > > I think a "hybrid" explain mode might be worth thinking about. Use the > "current" sampling method for the first execution of a node, and for the first > few milliseconds of a query (or perhaps the first few timestamp > acquisitions). That provides an accurate explain analyze for short queries, > without a significant slowdown. Then switch to sampling, which provides decent > attribution for a bit longer running queries. > Yeah, this is essentially the sampling I imagined when I first read the subject of this thread. It samples which node executions to measure (and then measures those accurately), while these patches sample timestamps. > > >>>>> I've been thinking that we should consider making more of the instrumentation >>>>> code work like that. The amount of work we're doing in InstrStart/StopNode() >>>>> has steadily risen. When buffer usage and WAL usage are enabled, we're >>>>> executing over 1000 instructions! And each single Instrumentation node is ~450 >>>>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs >>>>> embedded. >>>>> >>>>> If we instead have InstrStartNode() set up a global pointer to the >>>>> Instrumentation node, we can make the instrumentation code modify both the >>>>> "global" counters (pgBufferUsage, pgWalUsage) and, if set, >>>>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger >>>>> changes - right now nodes "automatically" include the IO/WAL incurred in child >>>>> nodes, but that's just a small bit of additional summin-up to be done during >>>>> EXPLAIN. >>>>> >>>> >>>> That's certainly one way to implement that. I wonder if we could make >>>> that work without the global pointer, but I can't think of any. >>> >>> I don't see a realistic way at least. We could pass down an >>> "InstrumentationContext" through everything that needs to do IO and WAL. But >>> that seems infeasible at this point. > >> Why infeasible? > > Primarily the scale of the change. We'd have to pass down the context into all > table/index AM functions. And into a lot of bufmgr.c, xlog.c functions, > which'd require their callers to have access to the context. That's hundreds > if not thousands places. > > Adding that many function parameters might turn out to be noticable runtime > wise, due to increased register movement. I think for a number of the places > where we currently don't, we ought to use by-reference struct for the > not-always-used parameters, that then also could contain this context. > OK, I haven't realized we'd have to pass it to that many places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 17 Jan 2023 at 14:52, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/17/23 19:46, Andres Freund wrote: > > > I think a "hybrid" explain mode might be worth thinking about. Use the > > "current" sampling method for the first execution of a node, and for the first > > few milliseconds of a query (or perhaps the first few timestamp > > acquisitions). That provides an accurate explain analyze for short queries, > > without a significant slowdown. Then switch to sampling, which provides decent > > attribution for a bit longer running queries. > > > > Yeah, this is essentially the sampling I imagined when I first read the > subject of this thread. It samples which node executions to measure (and > then measures those accurately), while these patches sample timestamps. That sounds interesting. Fwiw my first thought would be to implement it a bit differently. Always have a timer running sampling right from the start, but also if there are less than, say, 1000 samples for a node then measure the actual start/finish time. So for any given node once you've hit enough samples to get a decent estimate you stop checking the time. That way any fast or rarely called nodes still have accurate measurements even if they get few samples and any long or frequently called nodes stop getting timestamps and just use timer counts. -- greg
This thread has been stale since January with no movement at all during the March CF, and according to the CFBot it stopped building at all ~ 14 weeks ago. I'm marking this returned with feedback, it can be resubmitted for a future CF if someone decides to pick it up. -- Daniel Gustafsson