Re: That EXPLAIN ANALYZE patch still needs work - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: That EXPLAIN ANALYZE patch still needs work |
Date | |
Msg-id | 23740.1149632250@sss.pgh.pa.us Whole thread Raw |
In response to | Re: That EXPLAIN ANALYZE patch still needs work (Martijn van Oosterhout <kleptog@svana.org>) |
List | pgsql-hackers |
Martijn van Oosterhout <kleptog@svana.org> writes: > This doesn't make any sense at all. How can a sampling run that only > sampled 7% of the actual tuples, end up with a actual measured time > that's more than 50% of the actual final runtime? AFAICS, the only conclusion is that the sampled executions are in fact taking longer than supposedly-equivalent unsampled ones, and by a pretty good percentage too. I'm growing unsure again about the mechanism responsible for that, however. > Can you get an estimate of the actual overhead (ie compare wall clock > time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is > on the order of a few hundred cycles, I'm not sure about the cost of > TLB flushes (that's probably highly architechture dependant). Here's some examples. Keep in mind I've already determined that gettimeofday() takes about 3 usec on this hardware ... regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0)ss; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1) -> Limit (cost=825.00..2484.00 rows=10000width=488) (actual time=208.208..2576.528 rows=10000 loops=1) -> Hash Join (cost=825.00..2484.00 rows=10000width=488) (actual time=208.190..2082.577 rows=10000 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.082..3.718 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=207.933..207.933 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..3.583 rows=10000loops=1)Total runtime: 805.036 ms (8 rows) Time: 816.463 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count -------10000 (1 row) Time: 816.970 m The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite a bit, probably because of the random variation we're using in sampling interval. This particular combination was unusually close. But in any case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small here; the problem is that we're incorrectly extrapolating the measured runtime to the unmeasured executions. What's especially interesting is that the excess time doesn't seem to show up if I form the query in a way that doesn't require pushing as much data around: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------Aggregate (cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1) -> Hash Join (cost=483.00..1316.00rows=10000 width=0) (actual time=88.061..160.886 rows=10000 loops=1) Hash Cond: (a.unique1 =b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=4) (actual time=0.071..4.068 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=4) (actual time=87.862..87.862 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=4) (actual time=0.031..4.780 rows=10000 loops=1)Totalruntime: 221.022 ms (7 rows) Time: 229.377 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;count -------10000 (1 row) Time: 202.531 ms regression=# (Without the OFFSET 0, the planner flattens the subquery and discovers that it doesn't actually need to fetch any of the non-join-key table columns.) Note the only plan nodes showing whacked-out timings are the ones returning wide tuples (large "width" values). I'm not entirely sure what to make of this. It could be interpreted as evidence for my theory about TLB reloads during userspace data access being the problem. But I'm getting a bit disenchanted with that theory after running the same test case in 8.1: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0)ss; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loops=1) -> Limit (cost=825.00..2484.00rows=10000 width=488) (actual time=330.961..989.272 rows=10000 loops=1) -> Hash Join (cost=825.00..2484.00rows=10000 width=488) (actual time=330.946..920.398 rows=10000 loops=1) Hash Cond: ("outer".unique1= "inner".unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actualtime=0.050..66.861 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=330.745..330.745rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244)(actual time=0.010..65.890 rows=10000 loops=1)Total runtime: 1041.293 ms (8 rows) Time: 1051.594 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count -------10000 (1 row) Time: 870.348 ms Both runtimes are pretty repeatable. That says that on this example, where we have sampled 40000 node executions (remember HASH is only called once), the overhead of EXPLAIN ANALYZE is about 4.5 usec per node, which is not very far from twice the gettimeofday() time; it certainly doesn't show any evidence for a huge additional overhead due to TLB flushes. One thing I was wondering about was whether the samplings of upper nodes might be correlating to a greater-than-random extent with the samplings of lower nodes. That would result in the observed problem of overestimating upper nodes. It doesn't explain the apparent dependence on targetlist width, though. So I'm confused again. > To be honest, I wonder about caching effects, but for the disks. All of the above are repeated executions that should be fully cached in RAM. regards, tom lane
pgsql-hackers by date: