Thread: Optimizer improvements: to do or not to do?
I intend to play with some optimizer aspects. Just for fun. I'm a novice in the DBMS development so I can not promise any available results but if it can be useful even as yet another failed attempt I will try. That's what I want to do: 1. Replace not very useful indexCorrelation with indexClustering. 2. Consider caching of inner table in a nested loops join during estimation total cost of the join. More details: 1. During analyze we have sample rows. For every N-th sample row we can scan indices on qual like 'value >= index_first_column' and fetch first N row TIDs. To estimate count of fetched heap pages is not hard. To take the index clustering value just divide the pages count by the sample rows count. 2. It's more-more harder and may be impossible to me at all. The main ideas: - split page fetches cost and CPU cost into different variables and don't summarize it before join estimation. - final path cost estimation should be done in the join cost estimation and take into account number of inner table access (=K). CPU cost is directly proportionate to K but page fetches can be estimated by Mackert and Lohman formula using the total tuples count (K * inner_table_selectivity * inner_table_total_tuples). Any thoughts?
On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote: > I intend to play with some optimizer aspects. Just for fun. Cool. If you think its fun (it is), you're half way there. > I'm a > novice in the DBMS development so I can not promise any available > results but if it can be useful even as yet another failed attempt I > will try. This type of work is 90% analysis, 10% coding. You'll need to do a lot of investigation, lots of discussion and listening. > That's what I want to do: > 1. Replace not very useful indexCorrelation with indexClustering. An opinion such as "not very useful" isn't considered sufficient explanation or justification for a change around here. > 2. Consider caching of inner table in a nested loops join during > estimation total cost of the join. > > More details: > 1. During analyze we have sample rows. For every N-th sample row we can > scan indices on qual like 'value >= index_first_column' and fetch first > N row TIDs. To estimate count of fetched heap pages is not hard. To > take the index clustering value just divide the pages count by the > sample rows count. > 2. It's more-more harder and may be impossible to me at all. The main > ideas: > - split page fetches cost and CPU cost into different variables and > don't summarize it before join estimation. > - final path cost estimation should be done in the join cost estimation > and take into account number of inner table access (=K). CPU cost is > directly proportionate to K but page fetches can be estimated by > Mackert and Lohman formula using the total tuples count (K * > inner_table_selectivity * inner_table_total_tuples). I'd work on one thing at a time and go into it deeply. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: >> That's what I want to do: >> 1. Replace not very useful indexCorrelation with indexClustering. > > An opinion such as "not very useful" isn't considered sufficient > explanation or justification for a change around here. There's been some previous discussion about how "correlation" was not really what we wanted to be measuring. But that discussion was in regards to cross-column "correlation". In that case we're trying to predict how selective a clause will be. If we read x% of the table due to a restriction on X what percentage of the values of Y will be represented? In this case I think we do need to know correlation or something like it. That's because what we're trying to predict is how close to sequential the i/o accesses will be. If there's no correlation between index order and disk order then they'll be random. If they're highly correlated then accesses will be close to sequential. It's possible there's some sort of "block-wise correlated" measure which would be even better for our needs. We don't care if all the high values are towards the start and low values towards the end as long as each section is in order, for example. It's also possible that we could use something like what you describe to predict how many physical i/os will happen altogether. If the table is highly clustered but disordered then the io will be random access but the cache will be more effective than if the table is highly correlated but not clustered (though it would take a large table to make that possible I think). In short I think what's needed is someone to review a lot of different stats metrics for correlation and clustering and do some analysis of how each would be useful for cost modelling. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > This type of work is 90% analysis, 10% coding. You'll need to do a lot > of investigation, lots of discussion and listening. I absolutely agree with you and I am not about to rush into coding right now. First of all I'm going to dig a lot in the PG sources, readme's and so on. It's a good school of coding and DBMS internals understanding. > > That's what I want to do: > > 1. Replace not very useful indexCorrelation with indexClustering. > > An opinion such as "not very useful" isn't considered sufficient > explanation or justification for a change around here. Sometimes the indexCorrelation even wrongful. There are many examples of overestimation of index scan cost (data well-clustered but not ordered - correlation is low) and some cases of underestimation when tuples look like well ordered with high degree of correlation, but index scan actually causes random page fetches (1-3-2-4-6-5, for example. On server without RAID it is VERY slow. 25 times slower than bitmap index scan). If we have special clustering measure we can more precisely estimate pages count. The next step could be to introduce 'ordering' as a measure of pages access sequentiality. Without the 'ordering' all we can assume that pages are fetched in random order. Anyhow, if index access cost is overestimated we can set random_page_cost=2. (Is it true in a production database with smart RAID?) Moreover, I think problem is more complex. With assumption that index access is always random we dip in another problem: overestimation of master table index scan. If it is small enough PG can choose seq scan instead of index scan even if the last one actually much cheaper because of caching. That is why caching should be taking into account during joining cost calculation. > > 2. Consider caching of inner table in a nested loops join during > > estimation total cost of the join. > I'd work on one thing at a time and go into it deeply. Good news. So I'm very interested in what you think about my ideas. Is it wrong or too naive?
Am Dienstag, 12. September 2006 12:48 schrieb Say42: > That is why caching should be taking into account > during joining cost calculation. If you know of a more effective way to do that beyond the effective_cache_size parameter that we have now, let us know. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Gregory Stark <stark@enterprisedb.com> writes: > It's possible there's some sort of "block-wise correlated" measure > which would be even better for our needs. Actually, it seems obvious to me that the correlation measure ought to ignore within-block ordering, but right now it does not. OTOH it's not clear how important that is, as on a decent-size table you'll probably not have more than one sample row in a block anyway. regards, tom lane
Peter Eisentraut wrote: > If you know of a more effective way to do that beyond the effective_cache_size > parameter that we have now, let us know. I don't know the better way and it is not my goal at all. I think about more accurate cost estimation of nested loops join and subqueries. Usual case in data request is a joining detail and some master tables into a single relation. Often master tables are small and after some nested loops iterations are well (perhaps wholly) cached. Cost estimation of the tables access path don't care about the such caching and cause overestimation. In some cases it can lead up to choosing not the best plan. Example from real life. The following request return count of national calls from the call registration table. select count(*) from conn.conn20060803 c where exists (select code from trunk_codes tc where c.bnum >= tc.code and c.bnum like tc.code || '%' orderby tc.code desc limit 1) enable_seqscan = off: "Aggregate (cost=103185258.68..103185258.69 rows=1 width=0) (actual time=13385.674..13385.676 rows=1 loops=1)" " -> Seq Scan on conn20060803 c (cost=100000000.00..103184640.52 rows=247264 width=0) (actual time=0.409..13307.254 rows=38739 loops=1)" " Filter: (subplan)" " SubPlan" " -> Limit (cost=0.00..6.42 rows=1 width=10) (actual time=0.020..0.020 rows=0 loops=494527)" " -> Index Scan Backward using belg_mobile_pkey on belg_mobile tc (cost=0.00..6.42 rows=1 width=10) (actual time=0.012..0.012 rows=0 loops=494527)" " Index Cond: (($0)::text >= (code)::text)" " Filter: (($0)::text ~~ ((code)::text || '%'::text))" "Total runtime: 13385.808 ms" enable_seqscan =on: "Aggregate (cost=1101623.47..1101623.48 rows=1 width=0) (actual time=63724.508..63724.509 rows=1 loops=1)" " -> Seq Scan on conn20060803 c (cost=0.00..1101005.30 rows=247264 width=0) (actual time=2.244..63640.413 rows=38739 loops=1)" " Filter: (subplan)" " SubPlan" " -> Limit (cost=2.20..2.20 rows=1 width=10) (actual time=0.121..0.121 rows=0 loops=494527)" " -> Sort (cost=2.20..2.20 rows=1 width=10) (actual time=0.114..0.114 rows=0 loops=494527)" " Sort Key: code" " -> Seq Scan on belg_mobile tc (cost=0.00..2.19 rows=1 width=10) (actual time=0.096..0.099 rows=0 loops=494527)" " Filter: ((($0)::text >= (code)::text) AND (($0)::text ~~ ((code)::text || '%'::text)))" "Total runtime: 63724.630 ms"
"Say42" <andrews42@yandex.ru> writes: > Usual case in data request is a joining detail and some master tables > into a single relation. Optimizing on the basis of only one example is seldom a good idea... and I think you are falling into that trap by supposing that there is a "usual case". regards, tom lane
Say42 wrote: > select count(*) from conn.conn20060803 c > where > exists (select code from belg_mobile tc ... Correction: replace 'trunk_codes' with 'belg_mobile'.
Tom Lane wrote: > Optimizing on the basis of only one example is seldom a good idea... > and I think you are falling into that trap by supposing that there > is a "usual case". Perhaps I am wrong but I assume normalization is a usual case, small master (parent) tables are not very rare also. Yes, my example is unusual but it is _real_ and demonstrate PG optimizer inaccuracy. Why don't we make PG optimizer more close to reality if we can? Is it so needless and I make a mountain out of a molehill?
Say42 wrote: > Perhaps I am wrong but I assume normalization is a usual case, small > master (parent) tables are not very rare also. > Yes, my example is unusual but it is _real_ and demonstrate PG > optimizer inaccuracy. Why don't we make PG optimizer more close to > reality if we can? Is it so needless and I make a mountain out of a > molehill? All you have shown so far is that one particular query runs faster on your machine when sequential scans are turned off. That is certainly a problem that is worth addressing. But you haven't offered any analysis about the cause of this problem, so any speculation about normalization, usual cases, caching effects and so on are unfounded and premature. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > But you haven't offered any analysis about the cause of this problem, so any > speculation about normalization, usual cases, caching effects and so on are > unfounded and premature. Ok. My previous message was a bit pompous and unfounded. Sorry. Below I'll try to explain what I mean when I spoke about caching effect. Let's take my pervious example (I repost query and some lines from 'explain' here for convenience): select count(*) from conn.conn20060803 c where exists (select code from belg_mobile tc where c.bnum >= tc.code andc.bnum like tc.code || '%' order by tc.code desc limit 1) Index Scan Backward using belg_mobile_pkey on belg_mobile tc (cost=0.00..6.42 rows=1 width=10) (actual time=0.012..0.012 rows=0 loops=494527) Seq Scan on belg_mobile tc (cost=0.00..2.19 rows=1 width=10) (actual time=0.096..0.099 rows=0 loops=494527) belg_mobile is very small (68 rows (1 heap page) and has PK on code column (2 index pages)). indexCorrelation is equal to 0.0445 and almost don't affect cost estimation result. PG cost estimation (as far as I know, of course): Index scan cost = 2 (index pages) + 1 (heap pages) * 4 (random_page_cost) + ( 0.0025 (cpu_operator_cost) * 3 (# ops) + 0.001 (cpu_index_tuple_cost) + 0.01 (cpu_tuple_cost) ) * 68 (record count) * 0.5 (selectivity of subquery) ~ 6 (pages fetch cost) + 0.42 (cpu cost) = 6.42 Seq scan cost = 1(heap page) + (0.0025 (cpu_operator_cost) * 3 (# ops) + 0.01 (cpu_tuple_cost)) * 68 (record count) = 1 (pagesfetch cost) + 1.19 (cpu cost) = 2.19 The estimation is ok if we touch the belg_mobile table only once. In the subquery we do it many times. After the first iteration of the subquery all the belg_mobile's heap and index pages are in the cache and cost per iteration should be estimated using formulae: Index scan cost = ( 6 (pages fetch cost) + 0.42 (cpu cost) * 500K (conn table row count) ) / 500K ~ 0.42 Seq scan cost = ( 1 (pages fetch cost) + 1.19 (cpu cost) * 500K (conn table row count) ) / 500K ~ 1.19 Index scan actually more cheaper because less than one tenth of conn rows have appropriate codes in the belg_mobile table. That's what I want to say. I am not a veteran DBMS user so I can not gauge importance of this cost inaccuracy in the whole. I hope you help me to look at the problem (?) more widely than I can at the moment.
"Say42" <andrews42@yandex.ru> writes: > ... Let's take my pervious example (I repost query and some lines > from 'explain' here for convenience): > select count(*) from conn.conn20060803 c where > exists (select code from belg_mobile tc > where c.bnum >= tc.code and c.bnum like tc.code || '%' > order by tc.code desc limit 1) I'm having a hard time getting excited about improving this query when it's so badly coded in the first place. What's an ORDER BY doing in an EXISTS subquery? The LIMIT is unnecessary too. And the inner WHERE says nothing so much as "I don't know how to design a database" :-(. If we're going to look at specific examples we should at least look at examples that are representative of typical good practice. It is true that EXISTS() subqueries are planned independently without any idea of how often they might get re-executed. This would be good to fix but I don't see any clear way to do it --- at the time we are processing the outer WHERE, we don't have enough context to judge how many times a particular clause might be evaluated. (Yeah, in this case it's pretty obvious that it'll be executed once per conn20060803 row, but in join situations, or even just with additional outer WHERE clauses, it's not nearly so obvious.) regards, tom lane
Simon Riggs wrote: > On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote: >> That's what I want to do: >> 1. Replace not very useful indexCorrelation with indexClustering. > > An opinion such as "not very useful" isn't considered sufficient > explanation or justification for a change around here. "Not sufficient for some types of data" would have been more fair. I speculate that an new additional stat of "average # of unique values for a column within a block" would go a long way to helping my worst queries. It's common here for queries to vastly overestimate the number of pages that would need to be read because postgresql's guess at the correlation being practically 0 despite the fact that the distinct values for any given column are closely packed on a few pages. Our biggest tables (180G or so) are mostly spatial data with columns like "City" "State" "Zip" "County" "Street" "School District", "Police Beat", "lat/long" etc; and we cluster the table on zip,street. Note that practically all the rows for any single value of any of the columns will lay in the same few blocks. However the calculated "correlation" being low because the total ordering of the other values doesn't match that of zip codes. This makes the optimizer vastly overestimate the cost of index scans because it guesses that most of the table will need to be read, even though in reality just a few pages are needed. If someone does look at the correlation calculations, I hope this type of data gets considered as well. I speculate that a new stat of "average # of unique values for a column within a block" could be useful here in addition to correlation. For most all my columns in my big table, this stat would be 1 or 2; which I think would be a useful hint that despite a low "correlation", the distinct values are indeed packed together in blocks. That way the optimizer can see that a smaller number of pages would need to be accessed than correlation alone would suggest. Does this make sense, or am I missing something.
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > It's common here for queries to vastly overestimate the > number of pages that would need to be read because > postgresql's guess at the correlation being practically 0 > despite the fact that the distinct values for any given > column are closely packed on a few pages. I think we need a serious statistics jock to pipe up with some standard metrics that do what we need. Otherwise we'll never have a solid footing for the predictions we make and will never know how much we can trust them. That said I'm now going to do exactly what I just said we should stop doing and brain storm about an ad-hoc metric that might help: I wonder if what we need is something like: sort the sampled values by value and count up the average number of distinct blocks per value. That might let us predict how many pages a fetch of a specific value would retrieve. Or perhaps we need a second histogram where the quantities are of distinct pages rather than total records. We might also need a separate "average number of n-block spans per value" metric to predict how sequential the i/o will be in addition to how many pages will be fetched. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Sep 13, 2006, at 14:44 , Gregory Stark wrote: > > I think we need a serious statistics jock to pipe up with some > standard > metrics that do what we need. Otherwise we'll never have a solid > footing for > the predictions we make and will never know how much we can trust > them. > > That said I'm now going to do exactly what I just said we should > stop doing > and brain storm about an ad-hoc metric that might help: > > I wonder if what we need is something like: sort the sampled values > by value > and count up the average number of distinct blocks per value. That > might let > us predict how many pages a fetch of a specific value would > retrieve. Or > perhaps we need a second histogram where the quantities are of > distinct pages > rather than total records. > > We might also need a separate "average number of n-block spans per > value" > metric to predict how sequential the i/o will be in addition to how > many pages > will be fetched. Currently, statistics are only collected during an "ANALYZE". Why aren't statistics collected during actual query runs such as seq scans? One could turn such as beast off in order to get repeatable, deterministic optimizer results. -M
Gregory Stark wrote: > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: >> ...vastly overestimate the number of pages .. because postgresql's guess >> at the correlation being practically 0 despite the fact that the distinct >> values for any given column are closely packed on a few pages. > > I think we need a serious statistics jock to pipe up with some standard > metrics that do what we need. Otherwise we'll never have a solid footing for > the predictions we make and will never know how much we can trust them. Do we know if any such people participate/lurk on this list, or if the conversation should go elsewhere? > That said I'm now going to do exactly what I just said we should stop doing > and brain storm about an ad-hoc metric that might help: > > I wonder if what we need is something like: sort the sampled values by value > and count up the average number of distinct blocks per value. That might let > us predict how many pages a fetch of a specific value would retrieve. Or > perhaps we need a second histogram where the quantities are of distinct pages > rather than total records. Either of these sound like they might be an improvement over correlation itself to estimate the number of pages it'd need to read. Would it be relatively easy or hard for a programmer not too familiar with the code to experiment with these ideas? Where would be a good place to look. > We might also need a separate "average number of n-block spans per value" > metric to predict how sequential the i/o will be in addition to how many pages > will be fetched. I'm wildly guessing that, the # of pages itself seems to be a bigger factor than the sequential/random nature. For example, I do a query for data from a particular small city I'd only need dozens of pages, not many thousands. OTOH, it'd be neat to know if this were true. Is there any good way to make something like explain analyze show both the expected and actual # of pages and # of seeks?
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > I'm wildly guessing that, the # of pages itself seems to be > a bigger factor than the sequential/random nature. No, they're both important: fetching N pages in sequence is way cheaper than fetching the same number of pages scattered all over. This is partly because you reduce seeking at the hardware level, and partly because sequential reads cue the kernel to do read-ahead, allowing overlap of I/O and computation. regards, tom lane
Ron Mayer wrote: > Gregory Stark wrote: > >> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: >> >>> ...vastly overestimate the number of pages .. because postgresql's guess >>> at the correlation being practically 0 despite the fact that the distinct >>> values for any given column are closely packed on a few pages. >>> >> I think we need a serious statistics jock to pipe up with some standard >> metrics that do what we need. Otherwise we'll never have a solid footing for >> the predictions we make and will never know how much we can trust them. >> > > Do we know if any such people participate/lurk on this list, or > if the conversation should go elsewhere? > I lurk... I don't know if I'm a 'statistics jock', but I may be valuable if only I had a better understanding of how the optimizer works. I have been following this thread with interest, but could really do with a good pointer to background information beyond what I have read in the main postgres manual. Does such information exist, and if so, where ? Josh Reich
Joshua Reich <josh@root.net> writes: > I lurk... I don't know if I'm a 'statistics jock', but I may be > valuable if only I had a better understanding of how the optimizer > works. I have been following this thread with interest, but could really > do with a good pointer to background information beyond what I have read > in the main postgres manual. Does such information exist, and if so, > where ? Well, there's the 20000-foot view here: http://developer.postgresql.org/pgdocs/postgres/planner-optimizer.html but after that you have to start reading code. The optimizer README file may be useful: http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/optimizer/README but it goes into a lot of details that probably aren't interesting for your purposes. Most of the planner is just mechanism associated with generating different possible plans. The policy that determines which plan is chosen is the cost-estimation equations, and those are all in costsize.c and selfuncs.c: http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/optimizer/path/costsize.c http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c The division between these two files is a bit historical, but roughly speaking selfuncs.c knows about the behavior of specific WHERE-clause operators and index access methods, while costsize.c knows about the behavior of particular plan types. I'd like to think that costsize.c is well enough commented that you can follow it even without any C knowledge, but selfuncs.c may be a bit more daunting. Still, the comments are pretty extensive, and feel free to ask questions on pg-hackers. regards, tom lane
Tom Lane wrote: > I'm having a hard time getting excited about improving this query when > it's so badly coded in the first place. What's an ORDER BY doing in > an EXISTS subquery? The LIMIT is unnecessary too. And the inner WHERE > says nothing so much as "I don't know how to design a database" :-(. It was the test query which has the same execution plan for belg_mobile (and the same problem) as the production query below: select (select max(code) from belg_mobile tc where c.bnum >= tc.code and c.bnum like tc.code || '%') as code, c.cause,c.ilno, extract(hour from c.datetime) as hour, count(*) as cnt, sum(c.dur) as dur from conn.conn20060803 c where itgrp = :itgrp group by 1,2,3,4 It's a simple OLAP query for analysis telephonic traffic distribution over time and trunk codes. 'max(codes)' is used to get the most matching code. For example, 84725 and 8472 are both valid codes, and number 84725123456 must match 84725 but not 8472. The 'c.bnum >= tc.code' qual significantly reduce index scan and execution time.