Thread: why do optimizer parameters have to be set manually?
Hi, it seems to me that the optimizer parameters (like random_page_cost etc.) could easily be calculated and adjusted dynamically be the DB backend based on the planner's cost estimates and actual run times for different queries. Perhaps the developers could comment on that? I'm not sure how the parameters are used internally (apart from whatever "EXPLAIN" shows), but if cpu_operator_cost is the same for all operators, this should probably also be adjusted for individual operators (I suppose that ">" is not as costly as "~*"). As far as the static configuration is concerned, I'd be interested in other users' parameters and hardware configurations. Here's ours (for a write-intensive db that also performs many queries with regular expression matching): effective_cache_size = 1000000 # typically 8KB each #random_page_cost = 0.2 # units are one sequential page fetch cost random_page_cost = 3 # units are one sequential page fetch cost #cpu_tuple_cost = 0.01 # (same) cpu_index_tuple_cost = 0.01 # (same) 0.1 #cpu_operator_cost = 0.0025 # (same) cpu_operator_cost = 0.025 # (same) other options: shared_buffers = 240000 # 2*max_connections, min 16, typically 8KB each max_fsm_relations = 10000 # min 10, fsm is free space map, ~40 bytes max_fsm_pages = 10000000 # min 1000, fsm is free space map, ~6 bytes #max_locks_per_transaction = 20 # min 10 wal_buffers = 128 # min 4, typically 8KB each sort_mem = 800000 # min 64, size in KB vacuum_mem = 100000 # min 1024, size in KB checkpoint_segments = 80 # in logfile segments, min 1, 16MB each checkpoint_timeout = 300 # range 30-3600, in seconds commit_delay = 100000 # range 0-100000, in microseconds commit_siblings = 5 # range 1-1000 12GB RAM, dual 2,80GHz Xeon, 6x 10K rpm disks in a RAID-5, Linux 2.4.23 with HT enabled. Regards, Marinos
"Marinos J. Yannikos" <mjy@geizhals.at> writes: > it seems to me that the optimizer parameters (like random_page_cost > etc.) could easily be calculated and adjusted dynamically be the DB > backend based on the planner's cost estimates and actual run times for > different queries. Perhaps the developers could comment on that? No, they are not that easy to determine. In particular I think the idea of automatically feeding back error measurements is hopeless, because you cannot tell which parameters are wrong. > I'm not sure how the parameters are used internally (apart from whatever > "EXPLAIN" shows), but if cpu_operator_cost is the same for all > operators, this should probably also be adjusted for individual > operators (I suppose that ">" is not as costly as "~*"). In theory perhaps, but in practice this is far down in the noise in most situations. regards, tom lane
mjy@geizhals.at ("Marinos J. Yannikos") writes: > it seems to me that the optimizer parameters (like random_page_cost > etc.) could easily be calculated and adjusted dynamically be the DB > backend based on the planner's cost estimates and actual run times for > different queries. Perhaps the developers could comment on that? Yes, it seems like a Small Matter Of Programming. http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?SMOP In seriousness, yes, it would seem a reasonable idea to calculate some of these values a bit more dynamically. I would be inclined to start with something that ran a workload, and provided static values based on how that workload went. That would require NO intervention inside the DB server; it could be accomplished simply by writing a database script. Feel free to contribute either a script or a backend "hack"... -- let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
It appears that the optimizer only uses indexes for = clause? Dave
> It appears that the optimizer only uses indexes for = clause? The optimizer will used indexes for LIKE clauses, so long as the clause is a prefix search, eg: SELECT * FROM test WHERE a LIKE 'prf%'; Chris
after vacuum verbose analyze, I still get explain select * from isppm where item_upc_cd like '06038301234'; QUERY PLAN ----------------------------------------------------------------------- Seq Scan on isppm (cost=100000000.00..100009684.89 rows=2 width=791) Filter: (item_upc_cd ~~ '06038301234'::text) (2 rows) isp=# explain select * from isppm where item_upc_cd = '06038301234'; QUERY PLAN ------------------------------------------------------------------------ Index Scan using isppm_x0 on isppm (cost=0.00..5.86 rows=2 width=791) Index Cond: (item_upc_cd = '06038301234'::bpchar) (2 rows) Dave On Thu, 2003-12-18 at 20:38, Christopher Kings-Lynne wrote: > > It appears that the optimizer only uses indexes for = clause? > > The optimizer will used indexes for LIKE clauses, so long as the clause > is a prefix search, eg: > > SELECT * FROM test WHERE a LIKE 'prf%'; > > Chris > >
Re: is it possible to get the optimizer to use indexes with a like clause
From
Christopher Browne
Date:
pg@fastcrypt.com (Dave Cramer) wrote: > It appears that the optimizer only uses indexes for = clause? It can use indices only if there is a given prefix. Thus: where text_field like 'A%' can use the index, essentially transforming this into the clauses where text_field >= 'A' and text_field < 'B'. You can't get much out of an index for where text_field like '%SOMETHING' -- (reverse (concatenate 'string "moc.enworbbc" "@" "enworbbc")) http://www3.sympatico.ca/cbbrowne/wp.html "When the grammar checker identifies an error, it suggests a correction and can even makes some changes for you." -- Microsoft Word for Windows 2.0 User's Guide, p.35:
On Thu, 18 Dec 2003, Dave Cramer wrote: > after vacuum verbose analyze, I still get > > explain select * from isppm where item_upc_cd like '06038301234'; > QUERY PLAN > ----------------------------------------------------------------------- > Seq Scan on isppm (cost=100000000.00..100009684.89 rows=2 width=791) > Filter: (item_upc_cd ~~ '06038301234'::text) > (2 rows) IIRC, the other limitation is that it only does so in "C" locale due to wierdnesses in other locales.
Dave Cramer <pg@fastcrypt.com> writes: > after vacuum verbose analyze, I still get [a seqscan] The other gating factor is that you have to have initdb'd in C locale. Non-C locales tend to use wild and wooly sort orders that are not compatible with what LIKE needs. regards, tom lane
So even in a north-american locale, such as en_CA this will be a problem? Dave On Thu, 2003-12-18 at 22:44, Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: > > after vacuum verbose analyze, I still get [a seqscan] > > The other gating factor is that you have to have initdb'd in C locale. > Non-C locales tend to use wild and wooly sort orders that are not > compatible with what LIKE needs. > > regards, tom lane > >
Tom Lane wrote: > No, they are not that easy to determine. In particular I think the idea > of automatically feeding back error measurements is hopeless, because > you cannot tell which parameters are wrong. Isn't it just a matter of solving an equation system with n variables (n being the number of parameters), where each equation stands for the calculation of the run time of a particular query? I.e. something like this for a sequential scan over 1000 rows with e.g. 2 operators used per iteration that took 2 seconds (simplified so that the costs are actual timings and not relative costs to a base value): 1000 * sequential_scan_cost + 1000 * 2 * cpu_operator_cost = 2.0 seconds With a sufficient number of equations (not just n, since not all query plans use all the parameters) this system can be solved for the particular query mix that was used. E.g. with a second sequential scan over 2000 rows with 1 operator per iteration that took 3 seconds you can derive: sequential_scan_cost = 1ms cpu_operator_cost = 0.5ms This could probably be implemented with very little overhead compared to the actual run times of the queries. Regard, Marinos
Dave Cramer <pg@fastcrypt.com> writes: > So even in a north-american locale, such as en_CA this will be a > problem? If it's not "C" we won't try to optimize LIKE. I know en_US does not work (case-insensitive, funny rules about spaces, etc) and I would expect en_CA has the same issues. If you're using 7.4 you have the option to create a special index instead of re-initdb'ing your whole database. regards, tom lane
"Marinos J. Yannikos" <mjy@geizhals.at> writes: > Tom Lane wrote: >> No, they are not that easy to determine. In particular I think the idea >> of automatically feeding back error measurements is hopeless, because >> you cannot tell which parameters are wrong. > Isn't it just a matter of solving an equation system with n variables (n > being the number of parameters), where each equation stands for the > calculation of the run time of a particular query? If we knew all the variables involved, it might be (though since the equations would be nonlinear, the solution would be more difficult than you suppose). The real problems are: 1. There is lots of noise in any real-world measurement, mostly due to competition from other processes. 2. There are effects we don't even try to model, such as the current contents of kernel cache. Everybody who's done any work with Postgres knows that for small-to-middling tables, running the same query twice in a row will yield considerably different runtimes, because the second time through all the data will be in kernel cache. But we don't have any useful way to model that in the optimizer, since we can't see what the kernel has in its buffers. 3. Even for the effects we do try to model, some of the equations are pretty ad-hoc and might not fit real data very well. (I have little confidence in the current correction for index order correlation, for example.) In short, if you just try to fit the present cost equations to real data, what you'll get will inevitably be "garbage in, garbage out". You could easily end up with parameter values that are much less realistic than the defaults. Over time we'll doubtless improve the optimizer's cost models, and someday we might get to a point where this wouldn't be a fool's errand, but I don't see it happening in the foreseeable future. I think a more profitable approach is to set up special test code to try to approximate the value of individual parameters measured in isolation. For instance, the current default of 4.0 for random_page_cost was developed through rather extensive testing a few years ago, and I think it's still a decent average value (for the case where you are actually doing I/O, mind you). But if your disks have particularly fast or slow seek times, maybe it's not good for you. It might be useful to package up a test program that repeats those measurements on particular systems --- though the problem of noisy measurements still applies. It is not easy or cheap to get a measurement that isn't skewed by kernel caching behavior. (You need a test file significantly larger than RAM, and even then you'd better repeat the measurement quite a few times to see how much noise there is in it.) regards, tom lane
Hello,
i got indexes to work with "text_pattern_ops" for locale et_EE.
So instead of:
create index some_index_name on some_table(some_text_field);
nor
create index some_index_name on some_table(some_text_field text_ops);
try to create index as follows:
create index some_index_name on some_table(some_text_field text_pattern_ops);
Note that text_pattern_ops is available pg >= 7.4.
Regards,
Erki Kaldjärv
Webware OÜ
www.webware.ee
Tom Lane wrote:
i got indexes to work with "text_pattern_ops" for locale et_EE.
So instead of:
create index some_index_name on some_table(some_text_field);
nor
create index some_index_name on some_table(some_text_field text_ops);
try to create index as follows:
create index some_index_name on some_table(some_text_field text_pattern_ops);
Note that text_pattern_ops is available pg >= 7.4.
Regards,
Erki Kaldjärv
Webware OÜ
www.webware.ee
Tom Lane wrote:
Dave Cramer <pg@fastcrypt.com> writes:So even in a north-american locale, such as en_CA this will be a problem?If it's not "C" we won't try to optimize LIKE. I know en_US does not work (case-insensitive, funny rules about spaces, etc) and I would expect en_CA has the same issues. If you're using 7.4 you have the option to create a special index instead of re-initdb'ing your whole database. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
Tom Lane wrote: > easy or cheap to get a measurement that isn't skewed by kernel caching > behavior. (You need a test file significantly larger than RAM, and > even then you'd better repeat the measurement quite a few times to see > how much noise there is in it.) I found a really fast way in Linux to flush the kernel cache and that is to unmount the drive and then remount. Beats having to read though a file > RAM everytime.
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> It appears that the optimizer only uses indexes for = clause? > > The optimizer will used indexes for LIKE clauses, so long as the > clause is a prefix search, eg: > > SELECT * FROM test WHERE a LIKE 'prf%'; Doesn't this still depend on your locale? -Doug
Doug, Yes, it does depend on the locale, you can get around this in 7.4 by building the index with smart operators Dave On Thu, 2003-12-18 at 20:38, Doug McNaught wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > >> It appears that the optimizer only uses indexes for = clause? > > > > The optimizer will used indexes for LIKE clauses, so long as the > > clause is a prefix search, eg: > > > > SELECT * FROM test WHERE a LIKE 'prf%'; > > Doesn't this still depend on your locale? > > -Doug > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >