Thread: osdl-dbt3 run results - puzzled by the execution plans
Our hardware/software configuration: kernel: 2.5.74 distro: RH7.2 pgsql: 7.3.3 CPUS: 8 MHz: 700.217 model: Pentium III (Cascades) memory: 8298888 kB shmmax: 3705032704 We did several sets of runs(repeating runs with the same database parameters) and have the following observation: 1. With everything else the same, we did two run sets with small effective_cache_size (default=1000) and large (655360 i.e. 5GB or 60% of the system memory 8GB). It seems to me that small effective_cache_size favors the choice of nested loop joins (NLJ) while the big effective_cache_size is in favor of merge joins (MJ). We thought the large effective_cache_size should lead us to better plans. But we found the opposite. Three plans out of 22 are different. Two of those plans are worse in execution time by 2 times and 8 times. For example, one plan, that included NLJ ran in 4 seconds but the other, switching to an MJ, ran in 32 seconds. Please refer to the link at the end of this mail for the query and plans. Did we miss something, or improvements are needed for the optimizer? 2. Thanks to all the response we got from this mailing list, we decided to use SETSEED(0) default_statistics_target=1000 to reduce the variation. We get now the exact the same execution plans and costs with repeated runs and that reduced the variation a lot. However, within the same run set consist of 6 runs, we see 2-3% standard deviation for the run metrics associated with the multiple stream part of the test (as opposed to the single stream part). We would like to reduce the variation to be less than 1% so that a 2% change between two different kernels would be significant. Is there anything else we can do? query: http://developer.osdl.org/~jenny/11.sql plan with small effective_cache_size: http://developer.osdl.org/~jenny/small_effective_cache_size_plan plan with large effective_cache_size: http://developer.osdl.org/~jenny/large_effective_cache_size_plan Thanks, Jenny
> We thought the large effective_cache_size should lead us to better > plans. But we found the opposite. Maybe it's inappropriate for little old me to jump in here, but the plan isn't usually that important compared to the actual runtime. The links you give show the output of 'explain' but not 'explain analyze', so it's not clear wich plan is actually _faster_. If you really do have only 8MB of FS cache, then either plan will run slowly. If you really do have 5GB of FS cache then either plan will run a lot faster. Why would you deliberately give the planner false information about this? PG obviously thinks plan 1 is 'better' when pages have to be fetched from disk, and plan 2 is 'better' when they don't. Which is really better depends on whether those pages do have to be fetched from disk or not, and PG can only know what you tell it about that, so changing ECS without actually removing the RAM from the system seems a little pointless to me... M
Thanks for your prompt reply. On Thu, 2003-09-18 at 16:19, Matt Clark wrote: > > We thought the large effective_cache_size should lead us to better > > plans. But we found the opposite. > > Maybe it's inappropriate for little old me to jump in here, but the plan > isn't usually that important compared to the actual runtime. The links you > give show the output of 'explain' but not 'explain analyze', so it's not > clear wich plan is actually _faster_. > I put the EXPLAIN ANALYZE output at: http://developer.osdl.org/~jenny/large_explain_analyze http://developer.osdl.org/~jenny/small_explain_analyze The actual execution time is 37 seconds(large) vs 5 seconds (small). I concluded the one with nested loop one is faster since we saw it consistently faster than the merge join one in our runs. > If you really do have only 8MB of FS cache, then either plan will run > slowly. If you really do have 5GB of FS cache then either plan will run a > lot faster. Why would you deliberately give the planner false information > about this? > We did not. A little history of our runs: When we first started, not knowing PG well, we just used the default ECS value(1000). Then we realized since we have 8G of RAM, we should set ECS to 655360. But this leads the optimizer to pick a bad plan. This is the reason why we post this message. > PG obviously thinks plan 1 is 'better' when pages have to be fetched from > disk, and plan 2 is 'better' when they don't. Which is really better > depends on whether those pages do have to be fetched from disk or not, and > PG can only know what you tell it about that, so changing ECS without > actually removing the RAM from the system seems a little pointless to me... > > M > > > Regards, Jenny
Jenny Zhang <jenny@osdl.org> writes: > ... It seems to me that small > effective_cache_size favors the choice of nested loop joins (NLJ) > while the big effective_cache_size is in favor of merge joins (MJ). No, I wouldn't think that, because a nestloop plan will involve repeated fetches of the same tuples whereas a merge join doesn't (at least not when it sorts its inner input, as this plan does). Larger cache improves the odds of a repeated fetch not having to do I/O. In practice a larger cache area would also have some effects on access costs for the sort's temp file, but I don't think the planner's cost model for sorting takes that into account. As Matt Clark points out nearby, the real question is whether these planner estimates have anything to do with reality. EXPLAIN ANALYZE results would be far more interesting than plain EXPLAIN. > However, within the same run set consist of 6 runs, we see 2-3% > standard deviation for the run metrics associated with the multiple > stream part of the test (as opposed to the single stream part). <python> Och, laddie, we useta *dream* of 2-3% variation </python> > We would like to reduce the variation to be less than 1% so that a > 2% change between two different kernels would be significant. I think this is a pipe dream. Variation in where the data gets laid down on your disk drive would alone create more than that kind of delta. I'm frankly amazed you could get repeatability within 2-3%. regards, tom lane
> I put the EXPLAIN ANALYZE output at: > http://developer.osdl.org/~jenny/large_explain_analyze > http://developer.osdl.org/~jenny/small_explain_analyze > The actual execution time is 37 seconds(large) vs 5 seconds (small). > There's an obvious row count misestimation in the 'large' plan: -> Sort (cost=519.60..520.60 rows=400 width=31) (actual time=106.88..143.49 rows=30321 loops=1) but I'm not good enough at reading these things to tell if that's the cause of the problem, or if so how to fix it :-(
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think this is a pipe dream. Variation in where the data gets laid > down on your disk drive would alone create more than that kind of delta. > I'm frankly amazed you could get repeatability within 2-3%. I think the reason he gets good repeatability is because he's talking about the aggregate results for a whole test run. Not individual queries. In theory you could just run the whole test multiple times. The more times you run it the lower the variation in the total run time would be. Actually, the variation in run time is also a useful statistic, both for postgres and the kernel. It might be useful to do multiple complete runs and keep track of the average standard deviation of the time required for each step. Higher standard deviation implies queries can't be reliably depended on not to take inordinately long, which can be a problem for some working models. For the kernel it could mean latency issues or it could mean the swapper or buffer cache was overly aggressive. -- greg
On Thu, 18 Sep 2003 15:36:50 -0700, Jenny Zhang <jenny@osdl.org> wrote: >We thought the large effective_cache_size should lead us to better >plans. But we found the opposite. The common structure of your query plans is: Sort Sort Key: sum((partsupp.ps_supplycost * partsupp.ps_availqty)) InitPlan -> Aggregate -> SubPlan -> Aggregate Filter: (sum((ps_supplycost * ps_availqty)) > $0) -> Group -> Sort Sort Key: partsupp.ps_partkey -> SubPlan (same as above) where the SubPlan is -> Merge Join (cost=519.60..99880.05 rows=32068 width=65) (actual time=114.78..17435.28 rows=30400 loops=1) ctr=5.73 Merge Cond: ("outer".ps_suppkey = "inner".s_suppkey) -> Index Scan using i_ps_suppkey on partsupp (cost=0.00..96953.31 rows=801712 width=34) (actual time=0.42..14008.92 rows=799361 loops=1) ctr=6.92 -> Sort (cost=519.60..520.60 rows=400 width=31) (actual time=106.88..143.49 rows=30321 loops=1) ctr=3.63 Sort Key: supplier.s_suppkey -> SubSubPlan for large effective_cache_size and -> Nested Loop (cost=0.00..130168.30 rows=32068 width=65) (actual time=0.56..1374.41 rows=30400 loops=1) ctr=94.71 -> SubSubPlan -> Index Scan using i_ps_suppkey on partsupp (cost=0.00..323.16 rows=80 width=34) (actual time=0.16..2.98 rows=80 loops=380) ctr=108.44 Index Cond: (partsupp.ps_suppkey = "outer".s_suppkey) for small effective_cache_size. Both subplans have an almost identical subsubplan: -> Nested Loop (cost=0.00..502.31 rows=400 width=31) (actual time=0.23..110.51 rows=380 loops=1) ctr=4.55 Join Filter: ("inner".s_nationkey = "outer".n_nationkey) -> Seq Scan on nation (cost=0.00..1.31 rows=1 width=10) (actual time=0.08..0.14 rows=1 loops=1) ctr=9.36 Filter: (n_name = 'ETHIOPIA'::bpchar) -> Seq Scan on supplier (cost=0.00..376.00 rows=10000 width=21) (actual time=0.10..70.72 rows=10000 loops=1) ctr=5.32 I have added the ctr (cost:time ratio) for each plan node. These values are mostly between 5 and 10 with two notable exceptions: 1) -> Sort (cost=519.60..520.60 rows=400 width=31) (actual time=106.88..143.49 rows=30321 loops=1) ctr=3.63 It has already been noticed by Matt Clark that this is the only plan node where the row count estimation looks wrong. However, I don't believe that this has great influence on the total cost of the plan, because the ctr is not far from the usual range and if it were a bit higher, it would only add a few hundred cost units to a branch costing almost 100000 units. BTW I vaguely remember that there is something strange with the way actual rows are counted inside a merge join. Look at the branch below this plan node: It shows an actual row count of 380. 2) -> Index Scan using i_ps_suppkey on partsupp (cost=0.00..323.16 rows=80 width=34) (actual time=0.16..2.98 rows=80 loops=380) ctr=108.44 Here we have the only plan node where loops > 1, and it is the only one where the ctr is far off. The planner computes the cost for one loop and multiplies it by the number of loops (which it estimates quite accurately to be 400), thus getting a total cost of ca. 130000. We have no reason to believe that the single loop cost is very far from reality (for a *single* index scan), but the planner does not account for additional index scans hitting pages in the cache that have been brought in by preceding scans. This is a known problem, Tom has mentioned it several times, IIRC. Now I'm very interested in getting a better understanding of this problem, so could you please report the results of . \d i_ps_suppkey . VACUUM VERBOSE ANALYSE partsupp; VACUUM VERBOSE ANALYSE supplier; . SELECT attname, null_frac, avg_witdh, n_distinct, correlation FROM pg_stats WHERE tablename = 'partsupp' AND attname IN ('ps_suppkey', ...); Please insert other interesting column names for ..., especially those contained in i_ps_suppkey, if any. . SELECT relname, relpages, reltuples FROM pg_class WHERE relname IN ('partsupp', 'supplier', ...); ^^^ Add relevant index names here. . EXPLAIN ANALYSE SELECT ps_partkey, ps_supplycost, ps_availqty FROM partsupp, supplier WHERE ps_suppkey = s_suppkey AND s_nationkey = '<youknowit>'; The idea is to eliminate parts of the plan that are always the same. Omitting nation is possibly to much a simplification. In this case please re-add it. Do this test for small and large effective_cache_size. Force the use of other join methods by setting enable_<joinmethod> to off. Post all results. Jenny, I understand that this long message contains more questions than answers and is not of much help for you. OTOH your tests might be very helpful for Postgres development ... Servus Manfred
I posted more results as you requested: On Fri, 2003-09-19 at 08:08, Manfred Koizar wrote: > On Thu, 18 Sep 2003 15:36:50 -0700, Jenny Zhang <jenny@osdl.org> > wrote: > >We thought the large effective_cache_size should lead us to better > >plans. But we found the opposite. > > The common structure of your query plans is: > > Sort > Sort Key: sum((partsupp.ps_supplycost * partsupp.ps_availqty)) > InitPlan > -> Aggregate > -> SubPlan > -> Aggregate > Filter: (sum((ps_supplycost * ps_availqty)) > $0) > -> Group > -> Sort > Sort Key: partsupp.ps_partkey > -> SubPlan (same as above) > > where the SubPlan is > > -> Merge Join (cost=519.60..99880.05 rows=32068 width=65) > (actual time=114.78..17435.28 rows=30400 loops=1) > ctr=5.73 > Merge Cond: ("outer".ps_suppkey = "inner".s_suppkey) > -> Index Scan using i_ps_suppkey on partsupp > (cost=0.00..96953.31 rows=801712 width=34) > (actual time=0.42..14008.92 rows=799361 loops=1) > ctr=6.92 > -> Sort (cost=519.60..520.60 rows=400 width=31) > (actual time=106.88..143.49 rows=30321 loops=1) > ctr=3.63 > Sort Key: supplier.s_suppkey > -> SubSubPlan > > for large effective_cache_size and > > -> Nested Loop (cost=0.00..130168.30 rows=32068 width=65) > (actual time=0.56..1374.41 rows=30400 loops=1) > ctr=94.71 > -> SubSubPlan > -> Index Scan using i_ps_suppkey on partsupp > (cost=0.00..323.16 rows=80 width=34) > (actual time=0.16..2.98 rows=80 loops=380) > ctr=108.44 > Index Cond: (partsupp.ps_suppkey = "outer".s_suppkey) > > for small effective_cache_size. Both subplans have an almost > identical subsubplan: > > -> Nested Loop (cost=0.00..502.31 rows=400 width=31) > (actual time=0.23..110.51 rows=380 loops=1) > ctr=4.55 > Join Filter: ("inner".s_nationkey = "outer".n_nationkey) > -> Seq Scan on nation (cost=0.00..1.31 rows=1 width=10) > (actual time=0.08..0.14 rows=1 loops=1) > ctr=9.36 > Filter: (n_name = 'ETHIOPIA'::bpchar) > -> Seq Scan on supplier (cost=0.00..376.00 rows=10000 width=21) > (actual time=0.10..70.72 rows=10000 loops=1) > ctr=5.32 > > I have added the ctr (cost:time ratio) for each plan node. These > values are mostly between 5 and 10 with two notable exceptions: > > 1) -> Sort (cost=519.60..520.60 rows=400 width=31) > (actual time=106.88..143.49 rows=30321 loops=1) > ctr=3.63 > > It has already been noticed by Matt Clark that this is the only plan > node where the row count estimation looks wrong. However, I don't > believe that this has great influence on the total cost of the plan, > because the ctr is not far from the usual range and if it were a bit > higher, it would only add a few hundred cost units to a branch costing > almost 100000 units. BTW I vaguely remember that there is something > strange with the way actual rows are counted inside a merge join. > Look at the branch below this plan node: It shows an actual row count > of 380. > > 2) -> Index Scan using i_ps_suppkey on partsupp > (cost=0.00..323.16 rows=80 width=34) > (actual time=0.16..2.98 rows=80 loops=380) > ctr=108.44 > > Here we have the only plan node where loops > 1, and it is the only > one where the ctr is far off. The planner computes the cost for one > loop and multiplies it by the number of loops (which it estimates > quite accurately to be 400), thus getting a total cost of ca. 130000. > We have no reason to believe that the single loop cost is very far > from reality (for a *single* index scan), but the planner does not > account for additional index scans hitting pages in the cache that > have been brought in by preceding scans. This is a known problem, Tom > has mentioned it several times, IIRC. > > Now I'm very interested in getting a better understanding of this > problem, so could you please report the results of > > . \d i_ps_suppkey > http://developer.osdl.org/~jenny/pgsql-optimizer/disc_i_ps_suppkey > . VACUUM VERBOSE ANALYSE partsupp; > VACUUM VERBOSE ANALYSE supplier; > http://developer.osdl.org/~jenny/pgsql-optimizer/vacuum_verbose_analyze_partsupp http://developer.osdl.org/~jenny/pgsql-optimizer/vacuum_verbose_analyze_suppler > . SELECT attname, null_frac, avg_witdh, n_distinct, correlation > FROM pg_stats > WHERE tablename = 'partsupp' AND attname IN ('ps_suppkey', ...); > > Please insert other interesting column names for ..., especially > those contained in i_ps_suppkey, if any. > I put all the related columns http://developer.osdl.org/~jenny/pgsql-optimizer/info_partsupp_col > . SELECT relname, relpages, reltuples > FROM pg_class > WHERE relname IN ('partsupp', 'supplier', ...); > ^^^ > Add relevant index names here. > I put all the related tables http://developer.osdl.org/~jenny/pgsql-optimizer/info_table > . EXPLAIN ANALYSE > SELECT ps_partkey, ps_supplycost, ps_availqty > FROM partsupp, supplier > WHERE ps_suppkey = s_suppkey AND s_nationkey = '<youknowit>'; > > The idea is to eliminate parts of the plan that are always the same. > Omitting nation is possibly to much a simplification. In this case > please re-add it. > Do this test for small and large effective_cache_size. > Force the use of other join methods by setting enable_<joinmethod> > to off. Post all results. > http://developer.osdl.org/~jenny/pgsql-optimizer/explain_query_mk > > Jenny, I understand that this long message contains more questions > than answers and is not of much help for you. OTOH your tests might > be very helpful for Postgres development ... Let me know if you need anything else Jenny
On Thu, 2003-09-18 at 20:20, Tom Lane wrote: > Jenny Zhang <jenny@osdl.org> writes: > > ... It seems to me that small > > effective_cache_size favors the choice of nested loop joins (NLJ) > > while the big effective_cache_size is in favor of merge joins (MJ). > > No, I wouldn't think that, because a nestloop plan will involve repeated > fetches of the same tuples whereas a merge join doesn't (at least not > when it sorts its inner input, as this plan does). Larger cache > improves the odds of a repeated fetch not having to do I/O. In practice > a larger cache area would also have some effects on access costs for the > sort's temp file, but I don't think the planner's cost model for sorting > takes that into account. I think there is some misunderstanding here. What I meant to say is: From the plans we got, the optimizer favors the choice of nested loop joins (NLJ) while the big effective_cache_size is in favor of merge joins (MJ). Which we think is not appropriate. We verified that sort_mem has no impact on the plans. Though it would be nice to take that into account. > > As Matt Clark points out nearby, the real question is whether these > planner estimates have anything to do with reality. EXPLAIN ANALYZE > results would be far more interesting than plain EXPLAIN. > > > However, within the same run set consist of 6 runs, we see 2-3% > > standard deviation for the run metrics associated with the multiple > > stream part of the test (as opposed to the single stream part). > > <python> Och, laddie, we useta *dream* of 2-3% variation </python> > BTW, I am a she :-) > > We would like to reduce the variation to be less than 1% so that a > > 2% change between two different kernels would be significant. > > I think this is a pipe dream. Variation in where the data gets laid > down on your disk drive would alone create more than that kind of delta. > I'm frankly amazed you could get repeatability within 2-3%. > Greg is right. The repeatability is due to the aggregate results for a whole test run. As for individual query, the power test(single stream) is very consistent, and the throughput test(multiple streams), any given query execution time varies up to 15% if no swapping. If we set sort_mem too high and swapping occurs, the variation is bigger. Jenny
On Fri, 2003-09-19 at 06:12, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > I think this is a pipe dream. Variation in where the data gets laid > > down on your disk drive would alone create more than that kind of delta. > > I'm frankly amazed you could get repeatability within 2-3%. > > I think the reason he gets good repeatability is because he's talking about > the aggregate results for a whole test run. Not individual queries. In theory > you could just run the whole test multiple times. The more times you run it > the lower the variation in the total run time would be. > That is right. The repeatability is due to the aggregate results for a whole test run. As for individual query, the power test(single stream) is very consistent, and the throughput test(multiple streams), any given query execution time varies up to 15% if no swapping. If we set sort_mem too high and swapping occurs, the variation is bigger. > Actually, the variation in run time is also a useful statistic, both for > postgres and the kernel. It might be useful to do multiple complete runs and > keep track of the average standard deviation of the time required for each > step. > I created a page with the execution time(in seconds), average, and stddev for each query and each steps. The data is collected from 6 dbt3 runs. http://developer.osdl.org/~jenny/pgsql-optimizer/exetime.html > Higher standard deviation implies queries can't be reliably depended on not to > take inordinately long, which can be a problem for some working models. For > the kernel it could mean latency issues or it could mean the swapper or buffer > cache was overly aggressive. I agree. I can think of another reason why the performance varies even the swapper and buffer cache is not overly aggressive. Since PG depends on OS to manage the buffer cache(correct me if I am wrong), it is up to OS to decide what to keep in the cache. And OS can not anticipate what is likely needed next. Thanks, Jenny
On Fri, 19 Sep 2003 11:35:35 -0700, Jenny Zhang <jenny@osdl.org> wrote: >I posted more results as you requested: Unfortunately they only confirm what I suspected earlier: >> 2) -> Index Scan using i_ps_suppkey on partsupp >> (cost=0.00..323.16 rows=80 width=34) >> (actual time=0.16..2.98 rows=80 loops=380) >> ctr=108.44 >> the planner does not >> account for additional index scans hitting pages in the cache that >> have been brought in by preceding scans. This is a known problem PF1 = estimated number of page fetches for one loop ~ 320 L = estimated number of loops ~ 400 P = number of pages in relation ~ 21000 Cutting down the number of heap page fetches if PF1 * L > P and P < effective_cache_size seems like an obvious improvement, but I was not able to figure out where to make this change. Maybe it belongs into costsize.c near run_cost += outer_path_rows * (inner_path->total_cost - inner_path->startup_cost) * joininfactor; in cost_nestloop() or it should be pushed into the index cost estimation functions. Hackers? For now you have to keep lying about effective_cache_size to make the planner overestimate merge joins to compensate for the planner's overestimation of nested loops. Sorry for having no better answer. Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > Cutting down the number of heap page fetches if PF1 * L > P and P < > effective_cache_size seems like an obvious improvement, but I was not > able to figure out where to make this change. Maybe it belongs into > costsize.c near > run_cost += outer_path_rows * > (inner_path->total_cost - inner_path->startup_cost) * > joininfactor; I've been intending for some time to try to restructure the cost estimator so that repeated indexscans can be costed more accurately. Within the context of the heap-fetch-estimating algorithm, I think the entire execution of a nestloop-with-inner-index-scan could probably be treated as a single scan. I'm not sure how we adjust the estimates for the index-access part, though clearly those are too high as well. This doesn't seem to be a localized change unfortunately. Certainly costsize.c can't do it alone. regards, tom lane