Thread: Proposal for new SET variables for optimizer costs
I am about to implement some changes to the planner/optimizer's cost model, following up to the thread on pghackers beginning on 20 Jan. The main conclusion of that thread was that we needed to charge more for a page fetched nonsequentially than for a page fetched sequentially. After further investigation I have concluded that it is also appropriate to include explicit modeling of the cost of evaluation of WHERE clauses. For example, using the regression database and a query like select * from tenk1 where (unique1 = 1 and unique2 = 101) or (unique1 = 2 and unique2 = 102) or (unique1 = 3 and unique2 = 103) or ... 100 OR clauses ... (unique1 = 100 and unique2 = 200); (which is not too implausible for certain automatic query generators), I observe that a sequential scan takes about 6 seconds, vs. less than a second for a similar query with only 10 clauses. That says that the cost of evaluating a WHERE clause this large is far from negligible. The optimizer needs to account for this because different query plans can have a considerable impact on the number of tuples that the WHERE clause is evaluated for --- in this example, if we use indexscans to pull out just the tuples with the right values of 'unique1', then the WHERE clause need only be checked at 100 tuples, not all 10000. I believe it would be reasonable to charge a certain amount per operator or function appearing in the WHERE clause in order to account for this effect. (Currently I see no need to model the cost of evaluating the targetlist expressions. The same expressions should get evaluated for the same tuples no matter what query plan the optimizer picks, so we might as well just leave that cost out of our comparisons.) Also, as was previously mentioned on pghackers, I would like to add SET variables to control enabling/disabling of particular query plan types, so that different plans can be checked with less hassle than restarting psql with a new PGOPTIONS setting. This all leads to the following proposal for redoing the optimizer plan cost SET variables. The variables proposed below would replace COST_HEAP and COST_INDEX, which are poorly named IMHO and are definitely very misleadingly documented at present. (Note that all costs will still be referenced to the cost of a disk page fetch. We will take 1.0 as the cost of a sequential page fetch.) SET variable name Internal variable Proposed default RANDOM_PAGE_COST random_page_cost 4.0 Cost of fetching a disk block nonsequentially (as a multiple of the cost of a sequential block fetch). CPU_TUPLE_COST cpu_tuple_cost 0.01 Cost of CPU time per tuple processed within a query (as a fraction of the cost of a sequential disk block fetch). This renames the existing SET variable COST_HEAP (cpu_page_weight); but the default value is smaller than it used to be, since WHERE clause evaluation will now be accounted for separately. CPU_INDEX_TUPLE_COST cpu_index_tuple_cost 0.001 Cost of CPU time per index tuple processed within a query (as a fraction of the cost of a sequential disk block fetch). This renames the existing SET variable COST_INDEX (cpu_index_page_weight); but the default value is much smaller than it used to be, since the operator evaluation cost will account for the bulk of the cost of visiting an index tuple. CPU_OPERATOR_COST cpu_operator_cost 0.0025 Cost of CPU time per operator or function evaluated in a WHERE clause. Note that this would apply to operators evaluated at index tuples as well as those evaluated against heap tuples. (The proposed default corresponds to a ratio of 5 microsec against 2 millisec for a sequential block fetch, which seems to be about right on my workstation.) ENABLE_SEQSCAN enable_seqscan ON ENABLE_INDEXSCAN enable_indexscan ON ENABLE_TIDSCAN enable_tidscan ON ENABLE_SORT enable_sort ON ENABLE_NESTLOOP enable_nestloop ON ENABLE_MERGEJOIN enable_mergejoin ON ENABLE_HASHJOIN enable_hashjoin ON Provide access via SET to the already-existing internal optimizer control flags. Currently, it is possible to have COST_HEAP and COST_INDEX set automatically during connection startup; libpq will do that if the environment variables PGCOSTHEAP and/or PGCOSTINDEX are defined on the client side. If we want to continue that behavior, the environment variables for these variables would be named PGRANDOMPAGECOST etc (remove underscores and prepend PG). I'm not sure if we want to continue inventing client-side environment variables, however. Comments? Ideas for better names? Anyone object to renaming the existing variables? (BTW, although it could be argued that this might break existing scripts that set COST_HEAP or COST_INDEX, I doubt that there are any ... and given the existing doco, I doubt even more that anyone is setting appropriate values ...) regards, tom lane
Looks great. I wouldn't change a thing in your proposal. > I am about to implement some changes to the planner/optimizer's cost > model, following up to the thread on pghackers beginning on 20 Jan. > The main conclusion of that thread was that we needed to charge more for > a page fetched nonsequentially than for a page fetched sequentially. > After further investigation I have concluded that it is also appropriate > to include explicit modeling of the cost of evaluation of WHERE clauses. > For example, using the regression database and a query like > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 16:29 5/02/00 -0500, Tom Lane wrote: > >SET variable name Internal variable Proposed default > >RANDOM_PAGE_COST random_page_cost 4.0 > >Cost of fetching a disk block nonsequentially (as a multiple of the cost >of a sequential block fetch). > >CPU_TUPLE_COST cpu_tuple_cost 0.01 > > >CPU_INDEX_TUPLE_COST cpu_index_tuple_cost 0.001 > >CPU_OPERATOR_COST cpu_operator_cost 0.0025 > >ENABLE_SEQSCAN enable_seqscan ON > >ENABLE_INDEXSCAN enable_indexscan ON > >ENABLE_TIDSCAN enable_tidscan ON > >ENABLE_SORT enable_sort ON > >ENABLE_NESTLOOP enable_nestloop ON > >ENABLE_MERGEJOIN enable_mergejoin ON > >ENABLE_HASHJOIN enable_hashjoin ON > Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or something similar? Or doing something else to differentiate them from user-declared SQL variables? I have no idea if user-declared SQL variables are an SQL92 thing, but these variables are 'system' things, and some kind of differentiation seems like a good idea. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or > something similar? Or doing something else to differentiate them from > user-declared SQL variables? I see no need to do that, since the *only* place these names exist is in the SET command (and its friends SHOW and RESET), and SET exists only to set system control variables. There are no user-declared SQL variables. The names are quite long and underscore-filled enough without adding unnecessary prefixes, IMHO ;-) regards, tom lane
At 18:31 5/02/00 -0500, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or >> something similar? Or doing something else to differentiate them from >> user-declared SQL variables? > >I see no need to do that, since the *only* place these names exist is >in the SET command (and its friends SHOW and RESET), and SET exists only >to set system control variables. There are no user-declared SQL >variables. > >The names are quite long and underscore-filled enough without adding >unnecessary prefixes, IMHO ;-) I agree, given their complexity, they are unlikely to conflict with future SQL names, but the SET statment *is* part of the SQL standard, and I thought it would be good to be cautious in the names you choose. This would avoid any possible future conflict, as well as make it clear from the outset that they are *not* standard SQL names. Another option would be to add another command, eg. 'PG', which is used for all non-SQLxx commands: PG SET somename = somevalue PG VACUUM ...etc. But this has the disctinct disadvantage of being more work, and being cumbersome in comparison to changing names. The transition could be managed by supporting old commands until version 8.0, with an appropriate notice. Just my 0.02c worth. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > Another option would be to add another command, eg. 'PG', which is used for > all non-SQLxx commands: > > PG SET somename = somevalue > PG VACUUM > > ...etc. But this has the disctinct disadvantage of being more work, and > being cumbersome in comparison to changing names. This does not work out in terms of general SQL compatibility. Even if we treat commands after PG specially, no other SQL database would, and it would raise at least as many errors as the extension syntax. Nor is there any significant advantage of it within Postgres if we ever get a keyword clash with a future SQL revision - I'd rather not have a syntax that alows for two interpretations for the same keyword depending on whether it follows PG or not. Sevo -- Sevo Stille sevo@ip23.net